Skip to content

Commit be41572

Browse files
update approx_topk page and create /sql-functions folder (#91)
* update approx_topk page and restructure SQL References by creating individual function pages under /sql-functions folder * address review comments on sql functions pages
1 parent 9b11778 commit be41572

15 files changed

+608
-1
lines changed

docs/.pages

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,4 @@ nav:
2222
- Telemetry: telemetry.md
2323
- zPlane: zplane.md
2424
- Work Group: work_group.md
25+
- SQL Functions: sql-functions
Loading

docs/images/approx-topk-distinct.png

251 KB
Loading
446 KB
Loading

docs/images/approx-topk.png

402 KB
Loading

docs/sql-functions/.pages

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
nav:
2+
3+
- SQL Functions Overview: index.md
4+
- Full-Text Search Functions: full-text-search.md
5+
- Array Functions: array.md
6+
- Aggregate Functions: aggregate.md
7+
- Approximate Aggregate Functions: approximate-aggregate

docs/sql-functions/aggregate.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
Aggregate functions compute a single result from a set of input values. For usage of standard SQL aggregate functions such as `COUNT`, `SUM`, `AVG`, `MIN`, and `MAX`, refer to [PostgreSQL documentation](https://www.postgresql.org/docs/).
2+
3+
---
4+
5+
### `histogram`
6+
**Syntax**: histogram(field) or histogram(field, 'interval')
7+
**Description:** <br>
8+
Use the `histogram()` function to divide your time-based log data into fixed intervals and apply aggregate functions such as `COUNT()` or `SUM()` to analyze time-series patterns. This helps visualize trends over time and supports meaningful comparisons.<br><br>
9+
**Syntax:** <br>
10+
```sql
11+
histogram(timestamp_field, 'interval')
12+
```
13+
14+
- `timestamp_field`: A valid timestamp field, such as _timestamp.
15+
- `interval`: A fixed time interval in readable units such as '30 seconds', '1 minute', '15 minutes', or '1 hour'.
16+
17+
**Histogram with aggregate function** <br>
18+
```sql
19+
SELECT histogram(_timestamp, '30 seconds') AS key, COUNT(*) AS num
20+
FROM "default"
21+
GROUP BY key
22+
ORDER BY key
23+
```
24+
**Expected Output**: <br>
25+
26+
This query divides the log data into 30-second intervals.
27+
Each row in the result shows:
28+
29+
- **`key`**: The start time of the 30-second bucket.
30+
- **`num`**: The count of log records that fall within that time bucket.
31+
<br>
32+
![histogram](./images/sql-reference/histogram.png)
33+
34+
!!! note
35+
- If you do not specify an interval, the backend automatically determines a suitable value.
36+
- To ensure consistent bucket sizes and avoid unexpected behavior, it is recommended to always define the interval explicitly.
37+
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
nav:
2+
3+
- Overview: index.md
4+
- approx_topk : approx-topk.md
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
This page provides instructions on using the `approx_topk_distinct()` function.
2+
If you only need to find the top K most frequently occurring values in a field, refer to the [approx_topk()](../approx-topk/) function.
3+
4+
## What is approx_topk_distinct()
5+
The approx_topk_distinct() function returns an approximate list of the top K values from one field (field1) that have the most number of distinct values in another field (field2). It is designed to handle large-scale, high-cardinality datasets efficiently by combining two algorithms:
6+
7+
- **HyperLogLog**: Used to estimate the number of distinct values in field2 per field1.
8+
- **Space-Saving**: Used to select the top K field1 values with the highest estimated distinct counts.
9+
10+
Because both algorithms are probabilistic and the computation is distributed across multiple query nodes, the results are approximate.
11+
12+
---
13+
14+
## Query Syntax
15+
16+
```sql
17+
18+
SELECT approx_topk_distinct(field1, field2, K) FROM "stream_name"
19+
```
20+
Here:
21+
22+
- `field1`: The field to group by and return top results for.
23+
- `field2`: The field whose distinct values are counted per field1.
24+
- `K`: Number of top results to return.
25+
- `stream_name`: The stream containing the data
26+
27+
**Example**
28+
```sql
29+
SELECT approx_topk_distinct(clientip, user_agent, 5) FROM "demo1"
30+
```
31+
This query returns an approximate list of the top 5 `clientip` values that have the most number of distinct user_agent values in the `demo1` stream.
32+
33+
**Note:** The result is returned as an array of objects, where each object includes the value of `field1` and its corresponding distinct count based on `field2`.
34+
35+
```json
36+
{
37+
"item": [
38+
{ "value": "192.168.1.100", "count": 1450 },
39+
{ "value": "203.0.113.50", "count": 1170 },
40+
{ "value": "10.0.0.5", "count": 1160 },
41+
{ "value": "198.51.100.75", "count": 1040 },
42+
{ "value": "172.16.0.10", "count": 1010 }
43+
]
44+
}
45+
```
46+
47+
### Use `approx_topk_distinct` With `unnest`
48+
To convert the nested array into individual rows for easier readability or further processing, use the `unnest()` function.
49+
50+
```sql
51+
SELECT item.value as clientip, item.count as distinct_user_agent_count
52+
FROM (
53+
SELECT unnest(approx_topk_distinct(clientip, user_agent, 5)) as item
54+
FROM "demo1"
55+
)
56+
ORDER BY distinct_user_agent_count DESC
57+
```
58+
**Result**
59+
![approx_topk_distinct](../../images/approx-topk-distinct.png)
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
This page provides instructions on using the `approx_topk()` function and explains its performance benefits compared to the traditional `GROUP BY` method.
2+
3+
## What is `approx_topk`?
4+
The `approx_topk()` function returns an approximate list of the top K most frequently occurring values in a specified field. It uses the Space-Saving algorithm, a memory-efficient approach designed for high-cardinality data and distributed processing, providing significant [performance benefits](#performance-comparison).
5+
6+
> To find the top K values based on the number of distinct values in another field, use the [approx_topk_distinct() function](../approx-topk-distinct/).
7+
8+
---
9+
10+
## Query Syntax
11+
```sql
12+
13+
SELECT approx_topk(field_name, K) FROM "stream_name"
14+
```
15+
Here:
16+
17+
- `field_name`: The field for which top values should be retrieved.
18+
- `K`: The number of top values to return.
19+
- `stream_name`: The stream containing the data.
20+
21+
**Example**
22+
```sql
23+
SELECT approx_topk(clientip, 10) FROM "demo1"
24+
```
25+
This query returns an approximate list of the `top k` most frequently occurring values in the `clientip` field from the `demo1` stream.
26+
27+
**Result of `approx_topk`** <br>
28+
The result is returned as an array of objects, where each object includes the value and its corresponding count. For example:
29+
30+
```json
31+
{
32+
"item": [ { "value": "192.168.1.100", "count": 2650 }, { "value": "10.0.0.5", "count": 2230 }, { "value": "203.0.113.50", "count": 2210 }, { "value": "198.51.100.75", "count": 1979 }, { "value": "172.16.0.10", "count": 1939 } ]
33+
}
34+
```
35+
36+
### Use `approx_topk` With `unnest`
37+
To convert these nested results into individual rows, use the `unnest()` function.
38+
39+
```sql
40+
SELECT item.value as clientip, item.count as request_count
41+
FROM (
42+
SELECT unnest(approx_topk(clientip, 20)) as item
43+
FROM "demo1"
44+
)
45+
ORDER BY request_count
46+
DESC
47+
```
48+
**Result of `approx_topk()` with `unnest()`**
49+
This provides a flat output as shown below:
50+
51+
```json
52+
{ "value": "192.168.1.100", "count": 2650 }
53+
{ "value": "10.0.0.5", "count": 2230 }
54+
{ "value": "203.0.113.50", "count": 2210 }
55+
...
56+
```
57+
58+
---
59+
60+
## `GROUP BY` Versus `approx_topk`
61+
62+
### How `GROUP BY` Works
63+
The traditional way to find the top values in a field is by using a `GROUP BY` query combined with `ORDER BY` and `LIMIT`. <br>
64+
For example:
65+
66+
```sql
67+
68+
SELECT clientip AS x_axis_1, COUNT(*) AS y_axis_1
69+
FROM cdn_production
70+
GROUP BY x_axis_1
71+
ORDER BY y_axis_1 DESC
72+
LIMIT 10
73+
```
74+
This query counts how many times each unique `clientip` appears and returns the **top 10** based on that count.
75+
76+
??? info "Why Traditional `GROUP BY` Breaks in Large Datasets"
77+
In large datasets with high-cardinality fields, the query is executed across multiple querier nodes. Each node uses multiple CPU cores to process the data. The data is split into partitions, and each core handles a subset of partitions.
78+
79+
Consider the following scenario:
80+
81+
- Dataset contains `3 million` unique client IPs.
82+
- Query runs using `60` querier nodes.
83+
- Each core processes `60` CPU cores, with each core processing one partition.
84+
85+
This results in:
86+
87+
`3 million` values × `60` nodes × `60` cores or partitions = `10.8 billion` data entries being processed in memory.
88+
89+
This level of memory usage can overwhelm the system and cause failures.
90+
91+
**Typical Failure Message** <br>
92+
```
93+
Resources exhausted: Failed to allocate additional 63232256 bytes for GroupedHashAggregateStream[20] with 0 bytes already allocated for this reservation - 51510301 bytes remain available for the total pool
94+
```
95+
![Typical Failure Message](../../images/approx-top-k-error-in-traditional-method.png)
96+
This is a common limitation of using traditional `GROUP BY` with high-cardinality fields in large environments.
97+
98+
### How `approx_topk` Works
99+
When you run a query using `approx_topk()`, each query node processes a subset of the dataset and computes its local approximate top K values.
100+
Each node sends up to `max(K * 10, 1000)` values to the leader node rather than just **K** values. This provides buffer capacity to prevent missing globally frequent values that may not appear in the **local top K** lists of individual nodes.
101+
102+
Despite this optimization, `approx_topk()` still returns approximate results because the function uses a probabilistic algorithm and the query execution is distributed across nodes.
103+
104+
!!! Note
105+
106+
This method improves performance and reduces memory usage, especially in production-scale environments. It is a trade-off between precision and efficiency. View the **performance comparison** shown in the following section.
107+
108+
---
109+
110+
### Performance Comparison
111+
112+
When querying high-cardinality fields like clientip in large datasets, performance becomes critical. This section compares the execution performance of a traditional `GROUP BY` query with a query that uses the `approx_topk()` function.
113+
114+
**Use Case**<br>
115+
You want to identify the top 20 most frequent client IP addresses in the `demo1` stream based on request volume.
116+
117+
**Query 1: Using `GROUP BY` and `LIMIT`**<br>
118+
```sql
119+
SELECT clientip as "x_axis_1", count(_timestamp) as "y_axis_1"
120+
FROM "demo1"
121+
GROUP BY x_axis_1
122+
ORDER BY y_axis_1 DESC
123+
LIMIT 20
124+
```
125+
126+
**Query 2: Using `approx_topk()`**
127+
```sql
128+
SELECT item.value as clientip, item.count as request_count
129+
FROM (
130+
SELECT unnest(approx_topk(clientip, 20)) as item
131+
FROM "demo1"
132+
)
133+
ORDER BY request_count DESC
134+
```
135+
136+
**Results**
137+
<br>
138+
![Performance Difference Between `GROUP BY` and `approx_topk()](../../images/approx-topk.png)
139+
<br>
140+
Both queries were run against the same dataset using OpenObserve dashboards. Here are the observed query durations from the browser developer tools:
141+
142+
- The `GROUP BY` query without `approx_topk` took **1.46 seconds** to complete.
143+
- The query using `approx_topk` completed in **692 milliseconds**.
144+
145+
This demonstrates that **approx_topk** executed more than twice as fast in this scenario, delivering a performance improvement of **over 50 percent**.
146+
147+
---
148+
149+
## Limitations
150+
151+
The following are the known limitations of `approx_topk()` function:
152+
153+
- Results are approximate, not guaranteed to be exact. Not recommended when exact accuracy is critical for analysis or reporting.
154+
- Accuracy depends on data distribution across partitions.
155+
156+
---
157+
158+
## Frequently Asked Questions
159+
**Q.** Can I use a `WHERE` clause with `approx_topk()`? <br>
160+
**A.** Yes. You can apply a `WHERE` clause before calling the `approx_topk()` function to filter the dataset. This limits the scope of the top K calculation to only the matching records.
161+
162+
```sql
163+
SELECT item.value as clientip, item.count as request_count
164+
FROM (
165+
SELECT unnest(approx_topk(clientip, 5)) as item
166+
FROM "demo1"
167+
WHERE status = 401
168+
)
169+
ORDER BY request_count DESC
170+
```
171+
<br>
172+
![WHERE clause with approx_topk](../../images/approx-topk-with-filter.png)

0 commit comments

Comments
 (0)