-
Notifications
You must be signed in to change notification settings - Fork 779
Description
Note: This is a long issue. We would like to collect early feedback on the proposed directions before diving too deeply into any specific area of improvement.
Background
We ingested a table with one secondary index into a single standard Neon cluster:
- Row count: 110,728,822,784 rows
- Size: 16TB
- Secondary index size: 13TB
We are running a single tenant on a Pageserver with 42TB of SSD storage and 24 cores. Resource usage on the Pageserver is never saturated.
Configuration
We are running a single shard pageserver.
…
disk_usage_based_eviction = { max_usage_pct = 90, min_avail_bytes = 4200000000000, period = '1m' }
tenant_config = { pitr_interval = '1 hour' }
…
Status
After ingestion and index creation, the L1 layers have accumulated to a degree that makes reads extremely slow.
Specifically, we enabled a workload that generated about 1200 GetPage@LSN requests per second, and collected the following stats:
- P99 GetPage@LSN: 2.5s
- P99 number of deltas per read: > 256 (largest bucket)
- Layer count
- L1: 175K (40TB)
- Images: 133K (10TB)
Investigation
Our hypotheses:
- A large number of L1 layers causes high tail read latency;
- A prolonged ingestion period led to this accumulation of the L1 layers;
- Auto-compaction failed to compact these layers effectively;
To explore this further, we captured a snapshot of layer files stored locally in the Pageserver. One key range stood out with an unusually large number of LSN ranges (L1):
000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002
There are 2408 L1 files for this key range:
000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002__00002D2B06FE7E81-00002D2BB7E3EA59-v1-00000001
…
000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002__00003E79B55FEA09-00003E7A53EDE611-v1-00000001
000000067F000040490002800000FFFFFFFF-030000000000000000000000000000000002__00003E7A53EDE611-00003E7AF27BFD19-v1-00000001
Each file is about 9MB. Upon learning more about the key format, we found that this range is a metadata range that contains relsize and aux files.
We don’t see a significant skew on other ranges but are happy to provide more detail on them if needed.
Actions
To reduce the number of L1 files in this range, we attempted a manual compaction:
curl -X PUT 'http://hostname:9898/v1/tenant/cae15f273a43f591dc980aab10ca7c21/timeline/bf84cf04d98d17798586d6a322630578/compact' -H 'Content-Type: application/json' -d '{"compact_key_range": {"start": "000000067F00004049000040500004D1DF1A", "end": "030000000000000000000000000000000002"}}'
Since the legacy compaction does not support range-based compaction, we switched to the new tiered_compaction
algorithm.
The tiered_compaction
job is now running. It is processing ~200K jobs, with each job taking about 7 seconds to complete. The estimated time to completion is around 16 days.
We hope this will reduce the number of L1 files to a manageable level so that we can re-enable read traffic on the cluster.
Questions
- Would a metadata range with a long history cause high tail latency on GetPage@LSN on the data path?
- What could have led to this state, where one range with a long history without compaction?
- Are there any key metrics that we should have paid attention to during the ingestion?
Proposals
Compaction
We have identified several points in compaction paths that can be parallelized:
-
tiered_compaction
job execution: Would help reduce the compaction time; -
image creation Would help reduce the image creation time from P99@16minutes;
-
While the code itself can be parallelizable, the compaction job is so CPU heavy that it might starve other jobs with the current execution model (even with
FuturesUnordered
). We propose: -
Introducing a dedicated thread pool (or runtime) for compaction jobs to isolate and parallelize them better.
-
Making compaction jobs resumable. Currently, if the Pageserver restarts mid-run, compaction state is lost. This is especially painful for long-running jobs like ours.
-
In legacy compaction, we always run different layers of compaction and even GC in order. Theoretically, these steps are independent and can be divided into individual steps to improve the efficiency.
-
Improve eviction algorithm: pageserver: thrashing on compaction and eviction loop #12123
Storage broker
- Not related to this particular large table but one more thing we noticed is that pageserver has a hard dependency on storage-broker in recent commits. Can we make this a soft dependency? The behavior we saw is that pageserver cannot advance the
persistent_lsn
when it fails to talk to storage-broker. This in turn makes it not possible to ack WAL from Safekeeper. This led to Safekeeper not able to delete its WAL -> Safekeeper ran out of disk.