You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sql_reference.md
+7-3Lines changed: 7 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -494,7 +494,11 @@ ORDER BY request_count DESC
494
494
495
495
**Why Results Are Approximate** <br>
496
496
497
-
Results are approximate because some globally significant IPs might not appear in individual nodes' top 10 lists due to uneven data distribution across nodes. For example, an IP with moderate traffic across all nodes might have a high global total but not rank in any single node's top 10.
497
+
The approx_topk function returns approximate results because it relies on each query node sending only its local top N entries to the leader. The leader combines these partial lists to produce the final result.
498
+
499
+
If a value appears frequently across all nodes but never ranks in the top N on any individual node, it is excluded. This can cause high-frequency values to be missed globally.
500
+
501
+
For example, if an IP receives 400, 450, and 500 requests across three nodes but ranks 11th on each, it will not appear in any node’s top 10. Even though the global total is 1,350, it will be missed.
498
502
499
503
**Limitations** <br>
500
504
@@ -515,7 +519,7 @@ ORDER BY request_count DESC
515
519
-**field2**: The field to count distinct values of.
516
520
-**k**: Number of top results to return.
517
521
518
-
- Uses HyperLogLog algorithm for efficient distinct counting and Space-Saving algorithm for top-K selection on high-cardinality data.
522
+
- Uses [**HyperLogLog** algorithm] for efficient distinct counting and Space-Saving algorithm for top-K selection on high-cardinality data.
519
523
- Results are approximate due to the probabilistic nature of both algorithms and distributed processing across partitions.
520
524
521
525
**Example:**
@@ -568,7 +572,7 @@ ORDER BY distinct_count DESC
568
572
{"clientip":"172.16.0.30","distinct_count":790}
569
573
{"clientip":"192.168.1.150","distinct_count":690}
570
574
```
571
-
??? info "The HyperLogLog Algorithm Explained:"
575
+
??? info "The HyperLogLog Algorithm Explained:"
572
576
**Problem Statement**
573
577
574
578
Traditional `GROUP BY` operations with `DISTINCT` counts on high-cardinality fields can cause memory exhaustion in distributed systems. Consider this query:
0 commit comments