|
| 1 | +This page provides instructions on using the `approx_topk()` function and explains its performance benefits compared to the traditional `GROUP BY` method. |
| 2 | + |
| 3 | +## What is `approx_topk`? |
| 4 | +The `approx_topk()` function returns an approximate list of the top K most frequently occurring values in a specified field. It uses the Space-Saving algorithm, a memory-efficient approach designed for high-cardinality data and distributed processing, providing significant [performance benefits](#performance-comparison). |
| 5 | + |
| 6 | +> To find the top K values based on the number of distinct values in another field, use the [approx_topk_distinct() function](../approx-topk-distinct/). |
| 7 | +
|
| 8 | +--- |
| 9 | + |
| 10 | +## Query Syntax |
| 11 | +```sql |
| 12 | + |
| 13 | +SELECT approx_topk(field_name, K) FROM "stream_name" |
| 14 | +``` |
| 15 | +Here: |
| 16 | + |
| 17 | +- `field_name`: The field for which top values should be retrieved. |
| 18 | +- `K`: The number of top values to return. |
| 19 | +- `stream_name`: The stream containing the data. |
| 20 | + |
| 21 | +**Example** |
| 22 | +```sql |
| 23 | +SELECT approx_topk(clientip, 10) FROM "demo1" |
| 24 | +``` |
| 25 | +This query returns an approximate list of the `top k` most frequently occurring values in the `clientip` field from the `demo1` stream. |
| 26 | + |
| 27 | +**Result of `approx_topk`** <br> |
| 28 | +The result is returned as an array of objects, where each object includes the value and its corresponding count. For example: |
| 29 | + |
| 30 | +```json |
| 31 | +{ |
| 32 | + "item": [ { "value": "192.168.1.100", "count": 2650 }, { "value": "10.0.0.5", "count": 2230 }, { "value": "203.0.113.50", "count": 2210 }, { "value": "198.51.100.75", "count": 1979 }, { "value": "172.16.0.10", "count": 1939 } ] |
| 33 | +} |
| 34 | +``` |
| 35 | + |
| 36 | +### Use `approx_topk` With `unnest` |
| 37 | +To convert these nested results into individual rows, use the `unnest()` function. |
| 38 | + |
| 39 | +```sql |
| 40 | +SELECT item.value as clientip, item.count as request_count |
| 41 | +FROM ( |
| 42 | + SELECT unnest(approx_topk(clientip, 20)) as item |
| 43 | + FROM "demo1" |
| 44 | + ) |
| 45 | +ORDER BY request_count |
| 46 | +DESC |
| 47 | +``` |
| 48 | +**Result of `approx_topk()` with `unnest()`** |
| 49 | +This provides a flat output as shown below: |
| 50 | + |
| 51 | +```json |
| 52 | +{ "value": "192.168.1.100", "count": 2650 } |
| 53 | +{ "value": "10.0.0.5", "count": 2230 } |
| 54 | +{ "value": "203.0.113.50", "count": 2210 } |
| 55 | +... |
| 56 | +``` |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +## `GROUP BY` Versus `approx_topk` |
| 61 | + |
| 62 | +### How `GROUP BY` Works |
| 63 | +The traditional way to find the top values in a field is by using a `GROUP BY` query combined with `ORDER BY` and `LIMIT`. <br> |
| 64 | + For example: |
| 65 | + |
| 66 | + ```sql |
| 67 | + |
| 68 | + SELECT clientip AS x_axis_1, COUNT(*) AS y_axis_1 |
| 69 | + FROM cdn_production |
| 70 | + GROUP BY x_axis_1 |
| 71 | + ORDER BY y_axis_1 DESC |
| 72 | + LIMIT 10 |
| 73 | + ``` |
| 74 | + This query counts how many times each unique `clientip` appears and returns the **top 10** based on that count. |
| 75 | + |
| 76 | +??? info "Why Traditional `GROUP BY` Breaks in Large Datasets" |
| 77 | + In large datasets with high-cardinality fields, the query is executed across multiple querier nodes. Each node uses multiple CPU cores to process the data. The data is split into partitions, and each core handles a subset of partitions. |
| 78 | + |
| 79 | + Consider the following scenario: |
| 80 | + |
| 81 | + - Dataset contains `3 million` unique client IPs. |
| 82 | + - Query runs using `60` querier nodes. |
| 83 | + - Each core processes `60` CPU cores, with each core processing one partition. |
| 84 | + |
| 85 | + This results in: |
| 86 | + |
| 87 | + `3 million` values × `60` nodes × `60` cores or partitions = `10.8 billion` data entries being processed in memory. |
| 88 | + |
| 89 | + This level of memory usage can overwhelm the system and cause failures. |
| 90 | + |
| 91 | + **Typical Failure Message** <br> |
| 92 | + ``` |
| 93 | + Resources exhausted: Failed to allocate additional 63232256 bytes for GroupedHashAggregateStream[20] with 0 bytes already allocated for this reservation - 51510301 bytes remain available for the total pool |
| 94 | + ``` |
| 95 | +  |
| 96 | + This is a common limitation of using traditional `GROUP BY` with high-cardinality fields in large environments. |
| 97 | + |
| 98 | +### How `approx_topk` Works |
| 99 | +When you run a query using `approx_topk()`, each query node processes a subset of the dataset and computes its local approximate top K values. |
| 100 | +Each node sends up to `max(K * 10, 1000)` values to the leader node rather than just **K** values. This provides buffer capacity to prevent missing globally frequent values that may not appear in the **local top K** lists of individual nodes. |
| 101 | + |
| 102 | +Despite this optimization, `approx_topk()` still returns approximate results because the function uses a probabilistic algorithm and the query execution is distributed across nodes. |
| 103 | + |
| 104 | +!!! Note |
| 105 | + |
| 106 | + This method improves performance and reduces memory usage, especially in production-scale environments. It is a trade-off between precision and efficiency. View the **performance comparison** shown in the following section. |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +### Performance Comparison |
| 111 | + |
| 112 | +When querying high-cardinality fields like clientip in large datasets, performance becomes critical. This section compares the execution performance of a traditional `GROUP BY` query with a query that uses the `approx_topk()` function. |
| 113 | + |
| 114 | +**Use Case**<br> |
| 115 | +You want to identify the top 20 most frequent client IP addresses in the `demo1` stream based on request volume. |
| 116 | + |
| 117 | +**Query 1: Using `GROUP BY` and `LIMIT`**<br> |
| 118 | +```sql |
| 119 | +SELECT clientip as "x_axis_1", count(_timestamp) as "y_axis_1" |
| 120 | +FROM "demo1" |
| 121 | +GROUP BY x_axis_1 |
| 122 | +ORDER BY y_axis_1 DESC |
| 123 | +LIMIT 20 |
| 124 | +``` |
| 125 | + |
| 126 | +**Query 2: Using `approx_topk()`** |
| 127 | +```sql |
| 128 | +SELECT item.value as clientip, item.count as request_count |
| 129 | +FROM ( |
| 130 | + SELECT unnest(approx_topk(clientip, 20)) as item |
| 131 | + FROM "demo1" |
| 132 | +) |
| 133 | +ORDER BY request_count DESC |
| 134 | +``` |
| 135 | + |
| 136 | +**Results** |
| 137 | +<br> |
| 138 | + |
| 139 | +<br> |
| 140 | +Both queries were run against the same dataset using OpenObserve dashboards. Here are the observed query durations from the browser developer tools: |
| 141 | + |
| 142 | +- The `GROUP BY` query without `approx_topk` took **1.46 seconds** to complete. |
| 143 | +- The query using `approx_topk` completed in **692 milliseconds**. |
| 144 | + |
| 145 | +This demonstrates that **approx_topk** executed more than twice as fast in this scenario, delivering a performance improvement of **over 50 percent**. |
| 146 | + |
| 147 | +--- |
| 148 | + |
| 149 | +## Limitations |
| 150 | + |
| 151 | +The following are the known limitations of `approx_topk()` function: |
| 152 | + |
| 153 | +- Results are approximate, not guaranteed to be exact. Not recommended when exact accuracy is critical for analysis or reporting. |
| 154 | +- Accuracy depends on data distribution across partitions. |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## Frequently Asked Questions |
| 159 | +**Q.** Can I use a `WHERE` clause with `approx_topk()`? <br> |
| 160 | +**A.** Yes. You can apply a `WHERE` clause before calling the `approx_topk()` function to filter the dataset. This limits the scope of the top K calculation to only the matching records. |
| 161 | + |
| 162 | +```sql |
| 163 | +SELECT item.value as clientip, item.count as request_count |
| 164 | +FROM ( |
| 165 | + SELECT unnest(approx_topk(clientip, 5)) as item |
| 166 | + FROM "demo1" |
| 167 | + WHERE status = 401 |
| 168 | +) |
| 169 | +ORDER BY request_count DESC |
| 170 | +``` |
| 171 | +<br> |
| 172 | + |
0 commit comments