update the approx_topk_distinct documentation (#92)

DebashisBorgohainO2 · web-flow · commit 79a7a7436a15 · 2025-07-14T14:28:01.000+05:30
* update approx_topk page and restructure SQL References by creating individual function pages under /sql-functions folder

* address review comments on sql functions pages

* update the approx_topk_distinct documentation
diff --git a/docs/sql-functions/aggregate.md b/docs/sql-functions/aggregate.md
@@ -1,3 +1,9 @@
+
+---
+title: histogram() Function in OpenObserve
+description: This page explains how to use the histogram() function in OpenObserve to group time-based log data into fixed intervals for trend analysis. It includes syntax options with or without interval specification, use with aggregate functions such as COUNT(), and guidance on interpreting the result. A detailed example shows how logs are grouped into 30-second time buckets, along with the output format. Users are advised to specify intervals explicitly to ensure consistent and predictable results. The page also includes a visual example to support understanding. 
+---
+
 Aggregate functions compute a single result from a set of input values. For usage of standard SQL aggregate functions such as `COUNT`, `SUM`, `AVG`, `MIN`, and `MAX`, refer to [PostgreSQL documentation](https://www.postgresql.org/docs/).
 
 ---
diff --git a/docs/sql-functions/approximate-aggregate/.pages b/docs/sql-functions/approximate-aggregate/.pages
@@ -2,3 +2,5 @@ nav:
 
 - Overview: index.md
 - approx_topk : approx-topk.md
+- approx-topk-distinct: approx-topk-distinct.md
+
diff --git a/docs/sql-functions/approximate-aggregate/approx-topk-distinct.md b/docs/sql-functions/approximate-aggregate/approx-topk-distinct.md
@@ -1,8 +1,13 @@
+---
+title: approx_topk_distinct() Function in OpenObserve
+description: This page explains how to use the approx_topk_distinct() function in OpenObserve to identify the top K values in one field based on the highest number of distinct values in another field. It introduces the combined use of HyperLogLog and Space-Saving algorithms to efficiently process large, high-cardinality datasets. The guide includes SQL syntax, a usage example, and demonstrates how to flatten the result using the unnest() function. It also provides a sample output to help users understand the structure and interpretation of the result. For top values based only on frequency, refer to the approx_topk() function.
+---
+
 This page provides instructions on using the `approx_topk_distinct()` function. 
 If you only need to find the top K most frequently occurring values in a field, refer to the [approx_topk()](../approx-topk/) function.
 
-## What is approx_topk_distinct()
-The approx_topk_distinct() function returns an approximate list of the top K values from one field (field1) that have the most number of distinct values in another field (field2). It is designed to handle large-scale, high-cardinality datasets efficiently by combining two algorithms:
+## What is approx_topk_distinct?
+The `approx_topk_distinct()` function returns an approximate list of the top K values from one field (field1) that have the most number of distinct values in another field (field2). It is designed to handle large-scale, high-cardinality datasets efficiently by combining two algorithms:
 
 - **HyperLogLog**: Used to estimate the number of distinct values in field2 per field1.
 - **Space-Saving**: Used to select the top K field1 values with the highest estimated distinct counts.
@@ -56,4 +61,22 @@ FROM (
 ORDER BY distinct_user_agent_count DESC
 ```
 **Result**
-![approx_topk_distinct](../../images/approx-topk-distinct.png)
+<br>
+This query using approx_topk_distinct() with unnest() returns a flat result, where each row represents a value from field1 and its corresponding approximate distinct count from field2: <br>
+![approx_topk_distinct](../../images/approx-topk-distinct.png)
+
+## Performance Considerations
+The `approx_topk_distinct()` function is designed for high-cardinality fields and large datasets. It uses the same distributed and memory-efficient architecture as `approx_topk()`.
+
+For details on how this approach compares to traditional GROUP BY queries in terms of performance and memory usage, see the [approx_topk() guide](../approx-topk/).
+
+---
+
+## Limitations
+The following are the known limitations of `approx_topk_distinct()` function:
+
+Results are approximate, not guaranteed to be exact. Not recommended when exact accuracy is critical for analysis or reporting.
+Accuracy depends on data distribution across partitions.
+
+![approx_topk_distinct](../../images/approx-topk-distinct.png)
+
diff --git a/docs/sql-functions/approximate-aggregate/approx-topk.md b/docs/sql-functions/approximate-aggregate/approx-topk.md
@@ -1,3 +1,9 @@
+
+---
+title: approx_topk() Function in OpenObserve
+description: This page explains how to use the approx_topk() function in OpenObserve to identify the most frequent values in high-cardinality fields. It provides the SQL syntax, a usage example, result structure, and comparison with the traditional GROUP BY approach. The guide includes a detailed performance comparison and highlights memory efficiency in distributed query processing. It also demonstrates how to use approx_topk() with unnest() for flat output and explains scenarios where this function offers a practical advantage. Limitations and frequently asked questions are included to help users understand when to use this approximate method.
+---
+
 This page provides instructions on using the `approx_topk()` function and explains its performance benefits compared to the traditional `GROUP BY` method.
 
 ## What is `approx_topk`?
diff --git a/docs/sql-functions/approximate-aggregate/index.md b/docs/sql-functions/approximate-aggregate/index.md
@@ -2,4 +2,5 @@ OpenObserve provides the following approximate aggregate functions designed for
 
 Learn more: 
 
-- [approx_topk](../approximate-aggregate/approx-topk/)
+- [approx_topk](../approximate-aggregate/approx-topk/)
+- [approx_topk_distinct](../approximate-aggregate/approx-topk-distinct/)
diff --git a/docs/sql-functions/array.md b/docs/sql-functions/array.md
@@ -1,3 +1,8 @@
+---
+title: Array Functions in OpenObserve
+description: This page lists all supported array functions in OpenObserve, along with their syntax, descriptions, and usage examples. These functions operate on fields that contain stringified JSON arrays, enabling users to sort, count, extract subsets, join, and combine array elements. Functions such as arrsort, arrjoin, arrindex, arrzip, spath, and cast_to_arr help process and transform array data effectively. 
+---
+
 This page lists the array functions supported in OpenObserve, along with their usage formats, descriptions, and examples.
 
 The array functions operate on fields that contain arrays. In OpenObserve, array fields are typically stored as stringified JSON arrays.
diff --git a/docs/sql-functions/full-text-search.md b/docs/sql-functions/full-text-search.md
@@ -1,3 +1,8 @@
+---
+title: Full-Text Search Functions in OpenObserve
+description: This page describes the full-text search functions supported in OpenObserve, including their syntax, behavior, and examples. Functions such as str_match, str_match_ignore_case, match_all, re_match, and re_not_match allow users to filter logs based on exact string matches, case-insensitive searches, keyword searches across multiple fields, and pattern-based filtering using regular expressions. The guide also explains the role of inverted indexing and how to enable it for enhanced search coverage. Sample queries and output visuals are provided to help users apply these functions effectively in log analysis.
+---
+
 The full-text search functions allow you to filter records based on keyword or pattern matches within one or more fields. <br>This page lists the full-text search functions supported in OpenObserve, along with their usage formats, descriptions, and examples.
 
 ---