Merge pull request #85 from openobserve/topk-documentation

DebashisBorgohainO2 · web-flow · commit e02eb087aadc · 2025-07-08T16:43:04.000+05:30
documented approx_topk and approx_topk_distinct
diff --git a/docs/sql_reference.md b/docs/sql_reference.md
@@ -332,6 +332,277 @@ Each row in the result shows:
 !!! note
     To avoid unexpected bucket sizes based on internal defaults, always specify the bucket duration explicitly using units. 
 
-!!! info "Datafusion SQL reference"
-    OpenObserve uses [Apache DataFusion](https://datafusion.apache.org/user-guide/sql/index.html) as its query engine. All supported SQL syntax and functions are available through DataFusion.
+---
+
+### `approx_topk(field, k)`
+
+**Description:**
+
+- Returns the top `K` most frequent values for a specified field using the **Space-Saving algorithm** optimized for high-cardinality data.
+- Results are approximate due to distributed processing. Globally significant values may be missed if they do not appear in enough partitions' local top-K results.
+
+**Example:** 
+```sql
+SELECT approx_topk(clientip, 10) FROM "default"
+```
+It returns the `10` most frequently occurring client IP addresses from the `default` stream. 
+??? info "The Space-Saving Algorithm Explained:"
+    The Space-Saving algorithm enables efficient top-K queries on high-cardinality data by limiting memory usage during distributed query execution. This approach trades exact precision for system stability and performance. <br> 
+    **Problem Statement** <br>
+
+    Traditional GROUP BY operations on high-cardinality fields can cause memory exhaustion in distributed systems. Consider this query:
+
+    ```sql
+    SELECT clientip, count(*) as cnt 
+    FROM default 
+    GROUP BY clientip 
+    ORDER BY cnt DESC 
+    LIMIT 10
+    ```
+
+    **Challenges:**
+
+    - Dataset contains 3 million unique client IP addresses
+    - Query executes across 60 CPU cores with 60 data partitions
+    - Each core maintains hash tables during aggregation across all partitions
+    - Memory usage: 3M entries × 60 cores × 60 partitions = 10.8 billion hash table entries
+
+    **Typical Error Message:**
+    ```
+    Resources exhausted: Failed to allocate additional 63232256 bytes for GroupedHashAggregateStream[20] with 0 bytes already allocated for this reservation - 51510301 bytes remain available for the total pool
+    ```
+
+    **Solution: Space-Saving Mechanism** <br>
+
+    ```sql
+    SELECT approx_topk(clientip, 10) FROM default
+    ```
+
+    Instead of returning all unique values from each partition, each partition returns only its top 10 results. The leader node then aggregates these partial results to compute the final top 10.
+
+    **Example: Web Server Log Analysis** <br>
+
+    **Scenario** <br>
+    Find the top 10 client IPs by request count from web server logs distributed across 3 follower query nodes.
+
+    **Raw Data Distribution** <br>
+
+    | Rank | Node 1 | Requests | Node 2 | Requests | Node 3 | Requests |
+    |------|---------|----------|---------|----------|---------|----------|
+    | 1 | 192.168.1.100 | 850 | 192.168.1.100 | 920 | 192.168.1.100 | 880 |
+    | 2 | 10.0.0.5 | 720 | 203.0.113.50 | 780 | 10.0.0.5 | 810 |
+    | 3 | 203.0.113.50 | 680 | 198.51.100.75 | 740 | 203.0.113.50 | 750 |
+    | 4 | 172.16.0.10 | 620 | 10.0.0.5 | 700 | 198.51.100.75 | 690 |
+    | 5 | 192.168.1.200 | 580 | 172.16.0.10 | 660 | 172.16.0.10 | 650 |
+    | 6 | 198.51.100.75 | 540 | 192.168.1.200 | 640 | 192.168.1.200 | 610 |
+    | 7 | 10.0.0.25 | 500 | 203.0.113.80 | 600 | 203.0.113.80 | 570 |
+    | 8 | 172.16.0.30 | 480 | 172.16.0.30 | 580 | 10.0.0.25 | 530 |
+    | 9 | 203.0.113.80 | 460 | 10.0.0.25 | 560 | 172.16.0.30 | 490 |
+    | 10 | 192.168.1.150 | 440 | 192.168.1.150 | 520 | 192.168.1.150 | 450 |
+
+
+    **Follower Query Nodes Process Data** <br>
+
+    Each follower node executes the query locally and returns only its top 10 results:
+
+    | Rank | Node 1 → Leader | Requests | Node 2 → Leader | Requests | Node 3 → Leader | Requests |
+    |------|-----------------|----------|-----------------|----------|-----------------|----------|
+    | 1 | 192.168.1.100 | 850 | 192.168.1.100 | 920 | 192.168.1.100 | 880 |
+    | 2 | 10.0.0.5 | 720 | 203.0.113.50 | 780 | 10.0.0.5 | 810 |
+    | 3 | 203.0.113.50 | 680 | 198.51.100.75 | 740 | 203.0.113.50 | 750 |
+    | 4 | 172.16.0.10 | 620 | 10.0.0.5 | 700 | 198.51.100.75 | 690 |
+    | 5 | 192.168.1.200 | 580 | 172.16.0.10 | 660 | 172.16.0.10 | 650 |
+    | 6 | 198.51.100.75 | 540 | 192.168.1.200 | 640 | 192.168.1.200 | 610 |
+    | 7 | 10.0.0.25 | 500 | 203.0.113.80 | 600 | 203.0.113.80 | 570 |
+    | 8 | 172.16.0.30 | 480 | 172.16.0.30 | 580 | 10.0.0.25 | 530 |
+    | 9 | 203.0.113.80 | 460 | 10.0.0.25 | 560 | 172.16.0.30 | 490 |
+    | 10 | 192.168.1.150 | 440 | 192.168.1.150 | 520 | 192.168.1.150 | 450 |
+
+    **Leader Query Node Aggregates Results** <br>
+
+    | Client IP | Node 1 | Node 2 | Node 3 | Total Requests |
+    |-----------|---------|---------|---------|----------------|
+    | 192.168.1.100 | 850 | 920 | 880 | **2,650** |
+    | 10.0.0.5 | 720 | 700 | 810 | **2,230** |
+    | 203.0.113.50 | 680 | 780 | 750 | **2,210** |
+    | 198.51.100.75 | 540 | 740 | 690 | **1,970** |
+    | 172.16.0.10 | 620 | 660 | 650 | **1,930** |
+    | 192.168.1.200 | 580 | 640 | 610 | **1,830** |
+    | 203.0.113.80 | 460 | 600 | 570 | **1,630** |
+    | 10.0.0.25 | 500 | 560 | 530 | **1,590** |
+    | 172.16.0.30 | 480 | 580 | 490 | **1,550** |
+    | 192.168.1.150 | 440 | 520 | 450 | **1,410** |
+
+    **Final Top 10 Results:**
+
+    | Rank | Client IP | Total Requests |
+    |------|-----------|----------------|
+    | 1 | 192.168.1.100 | 2,650 |
+    | 2 | 10.0.0.5 | 2,230 |
+    | 3 | 203.0.113.50 | 2,210 |
+    | 4 | 198.51.100.75 | 1,970 |
+    | 5 | 172.16.0.10 | 1,930 |
+    | 6 | 192.168.1.200 | 1,830 |
+    | 7 | 203.0.113.80 | 1,630 |
+    | 8 | 10.0.0.25 | 1,590 |
+    | 9 | 172.16.0.30 | 1,550 |
+    | 10 | 192.168.1.150 | 1,410 |
+
+    **Why Results Are Approximate** <br>
+
+    Results are approximate because some globally significant IPs might not appear in individual nodes' top 10 lists due to uneven data distribution across nodes. For example, an IP with moderate traffic across all nodes might have a high global total but not rank in any single node's top 10.
+
+    **Limitations** <br>
+
+    - Results are approximate, not exact.
+    - Accuracy depends on data distribution across partitions.
+    - Filter clauses are not currently supported with approx_topk
+
+---
+
+### `approx_topk_distinct(field1, field2, k)`
+
+**Description:**
+
+- Returns the top `K` values from `field1` that have the most unique values in `field2`.
+- Here: 
+    
+    - **field1**: The field to group by and return top results for.  
+    - **field2**: The field to count distinct values of. 
+    - **k**: Number of top results to return.
+
+- Uses HyperLogLog algorithm for efficient distinct counting and Space-Saving algorithm for top-K selection on high-cardinality data.
+- Results are approximate due to the probabilistic nature of both algorithms and distributed processing across partitions.
+
+**Example:**
+
+```sql
+SELECT approx_topk_distinct(clientip, clientas, 3) FROM "default" ORDER BY _timestamp DESC
+```
+It returns the top 3 client IP addresses that have the most unique user agents.
+
+??? info "The HyperLogLog Algorithm Explained:"
+    **Problem Statement**
+
+    Traditional `GROUP BY` operations with `DISTINCT` counts on high-cardinality fields can cause memory exhaustion in distributed systems. Consider this query:
+
+    ```sql
+    SELECT clientip, count(distinct clientas) as cnt 
+    FROM default 
+    GROUP BY clientip 
+    ORDER BY cnt DESC 
+    LIMIT 10
+    ```
+
+    **Challenges:**
+
+    - Dataset contains 3 million unique client IP addresses.
+    - Each client IP can have thousands of unique user agents (`clientas`).
+    - Total unique user agents: 100 million values.
+    - Query executes across 60 CPU cores with 60 data partitions.
+    - Memory usage for distinct counting: Potentially unlimited storage for tracking unique values.
+    - Combined with grouping: Memory requirements become exponentially larger.
+
+    **Typical Error Message:**
+    ```
+    Resources exhausted: Failed to allocate additional 63232256 bytes for GroupedHashAggregateStream[20] with 0 bytes already allocated for this reservation - 51510301 bytes remain available for the total pool
+    ```
+
+    **Solution: HyperLogLog + Space-Saving Mechanism**
+
+    ```sql
+    SELECT approx_topk_distinct(clientip, clientas, 10) FROM default
+    ```
+
+    **Combined Approach:**
+
+    - **HyperLogLog**: Handles distinct counting using a fixed **16 kilobytes** data structure per group.
+    - **Space-Saving**: Limits the number of groups returned from each partition to top K.
+    - Each partition returns only its top 10 client IPs (by distinct user agent count) to the leader.
+
+    **Example: Web Server User Agent Analysis**
+    Find the top 10 client IPs by unique user agent count from web server logs in the `default` stream.
+
+    **Raw Data Distribution**
+
+    | Node 1 | Distinct User Agents | Node 2 | Distinct User Agents | Node 3 | Distinct User Agents |
+    |---------|---------------------|---------|---------------------|---------|---------------------|
+    | 192.168.1.100 | 450 | 192.168.1.100 | 520 | 192.168.1.100 | 480 |
+    | 10.0.0.5 | 380 | 203.0.113.50 | 420 | 10.0.0.5 | 410 |
+    | 203.0.113.50 | 350 | 198.51.100.75 | 390 | 203.0.113.50 | 400 |
+    | 172.16.0.10 | 320 | 10.0.0.5 | 370 | 198.51.100.75 | 370 |
+    | 192.168.1.200 | 300 | 172.16.0.10 | 350 | 172.16.0.10 | 340 |
+    | 198.51.100.75 | 280 | 192.168.1.200 | 330 | 192.168.1.200 | 320 |
+    | 10.0.0.25 | 260 | 203.0.113.80 | 310 | 203.0.113.80 | 300 |
+    | 172.16.0.30 | 240 | 172.16.0.30 | 290 | 10.0.0.25 | 280 |
+    | 203.0.113.80 | 220 | 10.0.0.25 | 270 | 172.16.0.30 | 260 |
+    | 192.168.1.150 | 200 | 192.168.1.150 | 250 | 192.168.1.150 | 240 |
+
+    **Note**: Each distinct count is computed using HyperLogLog's 16KB data structure per client IP.
+
+    **Follower Query Nodes Process Data**
+
+    Each follower node executes the query locally and returns only its top 10 results:
+
+    | Rank | Node 1 → Leader | Distinct Count | Node 2 → Leader | Distinct Count | Node 3 → Leader | Distinct Count |
+    |------|-----------------|----------------|-----------------|----------------|-----------------|----------------|
+    | 1 | 192.168.1.100 | 450 | 192.168.1.100 | 520 | 192.168.1.100 | 480 |
+    | 2 | 10.0.0.5 | 380 | 203.0.113.50 | 420 | 10.0.0.5 | 410 |
+    | 3 | 203.0.113.50 | 350 | 198.51.100.75 | 390 | 203.0.113.50 | 400 |
+    | 4 | 172.16.0.10 | 320 | 10.0.0.5 | 370 | 198.51.100.75 | 370 |
+    | 5 | 192.168.1.200 | 300 | 172.16.0.10 | 350 | 172.16.0.10 | 340 |
+    | 6 | 198.51.100.75 | 280 | 192.168.1.200 | 330 | 192.168.1.200 | 320 |
+    | 7 | 10.0.0.25 | 260 | 203.0.113.80 | 310 | 203.0.113.80 | 300 |
+    | 8 | 172.16.0.30 | 240 | 172.16.0.30 | 290 | 10.0.0.25 | 280 |
+    | 9 | 203.0.113.80 | 220 | 10.0.0.25 | 270 | 172.16.0.30 | 260 |
+    | 10 | 192.168.1.150 | 200 | 192.168.1.150 | 250 | 192.168.1.150 | 240 |
+
+    **Leader Query Node Aggregates Results**
+
+    | Client IP | Node 1 | Node 2 | Node 3 | Total Distinct User Agents |
+    |-----------|---------|---------|---------|---------------------------|
+    | 192.168.1.100 | 450 | 520 | 480 | **1,450** |
+    | 10.0.0.5 | 380 | 370 | 410 | **1,160** |
+    | 203.0.113.50 | 350 | 420 | 400 | **1,170** |
+    | 198.51.100.75 | 280 | 390 | 370 | **1,040** |
+    | 172.16.0.10 | 320 | 350 | 340 | **1,010** |
+    | 192.168.1.200 | 300 | 330 | 320 | **950** |
+    | 203.0.113.80 | 220 | 310 | 300 | **830** |
+    | 10.0.0.25 | 260 | 270 | 280 | **810** |
+    | 172.16.0.30 | 240 | 290 | 260 | **790** |
+    | 192.168.1.150 | 200 | 250 | 240 | **690** |
+
+    **Final Top 10 Results:**
+
+    | Rank | Client IP | Total Distinct User Agents |
+    |------|-----------|---------------------------|
+    | 1 | 192.168.1.100 | 1,450 |
+    | 2 | 203.0.113.50 | 1,170 |
+    | 3 | 10.0.0.5 | 1,160 |
+    | 4 | 198.51.100.75 | 1,040 |
+    | 5 | 172.16.0.10 | 1,010 |
+    | 6 | 192.168.1.200 | 950 |
+    | 7 | 203.0.113.80 | 830 |
+    | 8 | 10.0.0.25 | 810 |
+    | 9 | 172.16.0.30 | 790 |
+    | 10 | 192.168.1.150 | 690 |
+
+
+    ## Why Results Are Approximate
+    Results are approximate due to two factors:
+
+    1. **HyperLogLog approximation:** Distinct counts are estimated, not exact.
+    2. **Space-Saving distribution:** Some globally significant client IPs might not appear in individual nodes' top 10 lists due to uneven data distribution.
+
+    ## Limitations
+
+    - Results are approximate, not exact.
+    - Distinct count accuracy depends on HyperLogLog algorithm precision.
+    - Filter clauses are not currently supported with `approx_topk_distinct`.
+
+
+---
+
+## Related Links
+OpenObserve uses [Apache DataFusion](https://datafusion.apache.org/user-guide/sql/index.html) as its query engine. All supported SQL syntax and functions are available through DataFusion.
 
diff --git a/docs/user-guide/pipelines/use-pipelines.md b/docs/user-guide/pipelines/use-pipelines.md
@@ -154,7 +154,7 @@ The above example illustrates a basic pipeline setup. However, pipelines can bec
 <br>
 
 ## FAQ
-**Q**: If I set the frequency to 5 minutes and the current time is 23:03, when will the next runs happen?
+**Q**: If I set the frequency to 5 minutes and the current time is 23:03, when will the next runs happen? <br>
 **A**: OpenObserve aligns the next run to the nearest upcoming time that is divisible by the frequency, starting from the top of the hour in the configured timezone. This ensures that all runs occur at consistent and predictable intervals.
 **Example**<br>
 If the current time is 23:03, here is when the next run will occur for different frequencies: