Neon handling of a large table

> _Note: This is a long issue. We would like to collect early feedback on the proposed directions before diving too deeply into any specific area of improvement._

## Background

We ingested a table with one secondary index into a single standard Neon cluster:

- **Row count:** 110,728,822,784 rows  
- **Size:** 16TB  
- **Secondary index size:** 13TB  

We are running a single tenant on a Pageserver with 42TB of SSD storage and 24 cores. Resource usage on the Pageserver is never saturated.

## Configuration

We are running a single shard pageserver.

```
…
disk_usage_based_eviction = { max_usage_pct = 90, min_avail_bytes = 4200000000000, period = '1m' }                                                                    
tenant_config = { pitr_interval = '1 hour' }
…
```

## Status
After ingestion and index creation, the L1 layers have accumulated to a degree that makes reads extremely slow. 

Specifically, we enabled a workload that generated about 1200 GetPage@LSN requests per second, and collected the following stats:
* P99 GetPage@LSN: 2.5s
* P99 number of deltas per read: > 256 (largest bucket)
* Layer count
  * L1: 175K (40TB)
  * Images: 133K (10TB)

## Investigation
Our hypotheses:
* A large number of L1 layers causes high tail read latency;
* A prolonged ingestion period led to this accumulation of the L1 layers;
* Auto-compaction failed to compact these layers effectively;

To explore this further, we captured a snapshot of layer files stored locally in the Pageserver. One key range stood out with an unusually large number of LSN ranges (L1):

```
000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002
```

There are 2408 L1 files for this key range:

```
000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002__00002D2B06FE7E81-00002D2BB7E3EA59-v1-00000001
…
000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002__00003E79B55FEA09-00003E7A53EDE611-v1-00000001
000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002__00003E7A53EDE611-00003E7AF27BFD19-v1-00000001
```

Each file is about 9MB. Upon learning more about the key format, we found that this range is a metadata range that contains relsize and aux files. 

We don’t see a significant skew on other ranges but are happy to provide more detail on them if needed.

## Actions
To reduce the number of L1 files in this range, we attempted a manual compaction:

```
curl -X PUT 'http://hostname:9898/v1/tenant/cae15f273a43f591dc980aab10ca7c21/timeline/bf84cf04d98d17798586d6a322630578/compact' -H 'Content-Type: application/json' -d '{"compact_key_range": {"start": "000000067F00004049000040500004D1DF1A", "end": "030000000000000000000000000000000002"}}'
```
Since the legacy compaction does not support range-based compaction, we switched to the new `tiered_compaction` algorithm.

The `tiered_compaction` job is now running. It is processing ~200K jobs, with each job taking about 7 seconds to complete. The estimated time to completion is around 16 days.
We hope this will reduce the number of L1 files to a manageable level so that we can re-enable read traffic on the cluster.

## Questions
* Would a metadata range with a long history cause high tail latency on GetPage@LSN on the data path?
* What could have led to this state, where one range with a long history without compaction?
* Are there any key metrics that we should have paid attention to during the ingestion?

## Proposals

### Compaction
We have identified several points in compaction paths that can be parallelized:
* `tiered_compaction` [job execution](https://github.com/neondatabase/neon/blob/main/pageserver/compaction/src/compact_tiered.rs#L287): Would help reduce the compaction time;
* [image creation](https://github.com/neondatabase/neon/blob/main/pageserver/src/tenant/timeline.rs#L5612) Would help reduce the image creation time from P99@16minutes;

* While the code itself can be parallelizable, the compaction job is so CPU heavy that it might starve other jobs with the current execution model (even with `FuturesUnordered`). We propose:
 * Introducing a dedicated thread pool (or runtime) for compaction jobs to isolate and parallelize them better.
 * Making compaction jobs resumable. Currently, if the Pageserver restarts mid-run, compaction state is lost. This is especially painful for long-running jobs like ours.

* In legacy compaction, we always run different layers of compaction and even GC in order. Theoretically, these steps are independent and can be divided into individual steps to improve the efficiency.

* Improve eviction algorithm: https://github.com/neondatabase/neon/issues/12123

### Storage broker

* Not related to this particular large table but one more thing we noticed is that pageserver has a hard dependency on storage-broker in recent commits. Can we make this a soft dependency? The behavior we saw is that pageserver cannot advance the `persistent_lsn` when it fails to talk to storage-broker. This in turn makes it not possible to ack WAL from Safekeeper. This led to Safekeeper not able to delete its WAL -> Safekeeper ran out of disk. 

CC: @jcsp @skyzh 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Neon handling of a large table #12278

Background

Configuration

Status

Investigation

Actions

Questions

Proposals

Compaction

Storage broker

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Neon handling of a large table #12278

Description

Background

Configuration

Status

Investigation

Actions

Questions

Proposals

Compaction

Storage broker

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions