You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sql_reference.md
+273-2Lines changed: 273 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -332,6 +332,277 @@ Each row in the result shows:
332
332
!!! note
333
333
To avoid unexpected bucket sizes based on internal defaults, always specify the bucket duration explicitly using units.
334
334
335
-
!!! info "Datafusion SQL reference"
336
-
OpenObserve uses [Apache DataFusion](https://datafusion.apache.org/user-guide/sql/index.html) as its query engine. All supported SQL syntax and functions are available through DataFusion.
335
+
---
336
+
337
+
### `approx_topk(field, k)`
338
+
339
+
**Description:**
340
+
341
+
- Returns the top `K` most frequent values for a specified field using the **Space-Saving algorithm** optimized for high-cardinality data.
342
+
- Results are approximate due to distributed processing. Globally significant values may be missed if they do not appear in enough partitions' local top-K results.
343
+
344
+
**Example:**
345
+
```sql
346
+
SELECT approx_topk(clientip, 10) FROM"default"
347
+
```
348
+
It returns the `10` most frequently occurring client IP addresses from the `default` stream.
349
+
??? info "The Space-Saving Algorithm Explained:"
350
+
The Space-Saving algorithm enables efficient top-K queries on high-cardinality data by limiting memory usage during distributed query execution. This approach trades exact precision for system stability and performance. <br>
351
+
**Problem Statement** <br>
352
+
353
+
Traditional GROUP BY operations on high-cardinality fields can cause memory exhaustion in distributed systems. Consider this query:
354
+
355
+
```sql
356
+
SELECT clientip, count(*) as cnt
357
+
FROM default
358
+
GROUP BY clientip
359
+
ORDER BY cnt DESC
360
+
LIMIT 10
361
+
```
362
+
363
+
**Challenges:**
364
+
365
+
- Dataset contains 3 million unique client IP addresses
366
+
- Query executes across 60 CPU cores with 60 data partitions
367
+
- Each core maintains hash tables during aggregation across all partitions
Resources exhausted: Failed to allocate additional 63232256 bytes for GroupedHashAggregateStream[20] with 0 bytes already allocated for this reservation - 51510301 bytes remain available for the total pool
373
+
```
374
+
375
+
**Solution: Space-Saving Mechanism** <br>
376
+
377
+
```sql
378
+
SELECT approx_topk(clientip, 10) FROM default
379
+
```
380
+
381
+
Instead of returning all unique values from each partition, each partition returns only its top 10 results. The leader node then aggregates these partial results to compute the final top 10.
382
+
383
+
**Example: Web Server Log Analysis** <br>
384
+
385
+
**Scenario** <br>
386
+
Find the top 10 client IPs by request count from web server logs distributed across 3 follower query nodes.
Results are approximate because some globally significant IPs might not appear in individual nodes' top 10 lists due to uneven data distribution across nodes. For example, an IP with moderate traffic across all nodes might have a high global total but not rank in any single node's top 10.
454
+
455
+
**Limitations** <br>
456
+
457
+
- Results are approximate, not exact.
458
+
- Accuracy depends on data distribution across partitions.
459
+
- Filter clauses are not currently supported with approx_topk
460
+
461
+
---
462
+
463
+
### `approx_topk_distinct(field1, field2, k)`
464
+
465
+
**Description:**
466
+
467
+
- Returns the top `K` values from `field1` that have the most unique values in `field2`.
468
+
- Here:
469
+
470
+
-**field1**: The field to group by and return top results for.
471
+
-**field2**: The field to count distinct values of.
472
+
-**k**: Number of top results to return.
473
+
474
+
- Uses HyperLogLog algorithm for efficient distinct counting and Space-Saving algorithm for top-K selection on high-cardinality data.
475
+
- Results are approximate due to the probabilistic nature of both algorithms and distributed processing across partitions.
476
+
477
+
**Example:**
478
+
479
+
```sql
480
+
SELECT approx_topk_distinct(clientip, clientas, 3) FROM"default"ORDER BY _timestamp DESC
481
+
```
482
+
It returns the top 3 client IP addresses that have the most unique user agents.
483
+
484
+
??? info "The HyperLogLog Algorithm Explained:"
485
+
**Problem Statement**
486
+
487
+
Traditional `GROUP BY` operations with `DISTINCT` counts on high-cardinality fields can cause memory exhaustion in distributed systems. Consider this query:
488
+
489
+
```sql
490
+
SELECT clientip, count(distinct clientas) as cnt
491
+
FROM default
492
+
GROUP BY clientip
493
+
ORDER BY cnt DESC
494
+
LIMIT 10
495
+
```
496
+
497
+
**Challenges:**
498
+
499
+
- Dataset contains 3 million unique client IP addresses.
500
+
- Each client IP can have thousands of unique user agents (`clientas`).
501
+
- Total unique user agents: 100 million values.
502
+
- Query executes across 60 CPU cores with 60 data partitions.
503
+
- Memory usage for distinct counting: Potentially unlimited storage for tracking unique values.
504
+
- Combined with grouping: Memory requirements become exponentially larger.
505
+
506
+
**Typical Error Message:**
507
+
```
508
+
Resources exhausted: Failed to allocate additional 63232256 bytes for GroupedHashAggregateStream[20] with 0 bytes already allocated for this reservation - 51510301 bytes remain available for the total pool
1. **HyperLogLog approximation:** Distinct counts are estimated, not exact.
595
+
2. **Space-Saving distribution:** Some globally significant client IPs might not appear in individual nodes' top 10 lists due to uneven data distribution.
596
+
597
+
## Limitations
598
+
599
+
- Results are approximate, not exact.
600
+
- Distinct count accuracy depends on HyperLogLog algorithm precision.
601
+
- Filter clauses are not currently supported with `approx_topk_distinct`.
602
+
603
+
604
+
---
605
+
606
+
## Related Links
607
+
OpenObserve uses [Apache DataFusion](https://datafusion.apache.org/user-guide/sql/index.html) as its query engine. All supported SQL syntax and functions are available through DataFusion.
Copy file name to clipboardExpand all lines: docs/user-guide/pipelines/use-pipelines.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -154,7 +154,7 @@ The above example illustrates a basic pipeline setup. However, pipelines can bec
154
154
<br>
155
155
156
156
## FAQ
157
-
**Q**: If I set the frequency to 5 minutes and the current time is 23:03, when will the next runs happen?
157
+
**Q**: If I set the frequency to 5 minutes and the current time is 23:03, when will the next runs happen? <br>
158
158
**A**: OpenObserve aligns the next run to the nearest upcoming time that is divisible by the frequency, starting from the top of the hour in the configured timezone. This ensures that all runs occur at consistent and predictable intervals.
159
159
**Example**<br>
160
160
If the current time is 23:03, here is when the next run will occur for different frequencies:
0 commit comments