Skip to content

[Docs] Dataset management updated #2733

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,49 @@
subtitle: Guides you through the process of creating and managing datasets
---

Datasets can be used to track test cases you would like to evaluate your LLM on. Each dataset is made up of dictionary
Datasets can be used to track test cases you would like to evaluate your LLM on. Each dataset is made up of a dictionary
with any key value pairs. When getting started, we recommend having an `input` and optional `expected_output` fields for
example. These datasets can be created from:

- Python SDK: You can use the Python SDK to create an dataset and add items to it.
- Python SDK: You can use the Python SDK to create a dataset and add items to it.
- Traces table: You can add existing logged traces (from a production application for example) to a dataset.
- The Opik UI: You can manually create a dataset and add items to it.

Once a dataset has been created, you can run Experiments on it. Each Experiment will evaluate an LLM application based
on the test cases in the dataset using an evaluation metric and report the results back to the dataset.

## Create a dataset via the UI

The simplest and fastest way to create a dataset is directly in the Opik UI.
This is ideal for quickly bootstrapping datasets from CSV files without needing to write any code.

Steps:
1. Navigate to Evaluation > Datasets in the Opik UI.
2. Click Create new dataset.
3. In the pop-up modal:
* Provide a name and an optional description
* Optionally, upload a CSV file with your data
4. Click Create dataset.

<Frame>
<img src="/img/evaluation/create_dataset.png" />
</Frame>

CSV Format Requirements:
* Your CSV must contain exactly two columns:
* input
* output
* Maximum of 1,000 rows per upload.

<Tip>
The UI dataset creation has some limitations:
* Only two columns are allowed.
* File size is limited to 1,000 rows via the UI.
* No support for nested JSON structures in the CSV itself.

For datasets requiring rich metadata, complex schemas, or programmatic control, use the SDK instead (see the next section).
</Tip>

## Creating a dataset using the SDK

You can create a dataset and log items to it using the `get_or_create_dataset` method:
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading