[FEATURE] Support Custom Code Evaluators for Prompt Experiment Results

### 📋 Please ensure that:

- [x] I have searched existing issues to avoid duplicates
- [x] I have provided a clear problem statement and solution
- [x] I understand this is a feature request and not a bug report
- [x] I am willing to help implement this feature if needed
- [x] I have submitted this feature request in English (otherwise it will not be processed)

### 🎯 Problem Statement

Currently, evaluation relies mainly on LLM evaluators . However, in many real-world scenarios, users need to define custom evaluation logic that compares model outputs against expected results in a more flexible way.

We propose supporting a “Code Evaluator” feature, where users can submit a piece of Python (or other supported languages) code that executes custom logic based on:

Evaluation Input Data (e.g., ground truth, expected labels).

Model Return Result (e.g., JSON output).

Optional Model Parameters / Metadata.

The custom code should be able to parse the model’s output, apply domain-specific evaluation rules, and generate a final evaluation result (e.g., 0/1, a score, or even a custom label).

### 💡 Proposed Solution

Allow users to define evaluators by writing code (starting with Python support).

Provide execution environment to run the evaluator code for each evaluation case.

Inputs passed into evaluator:

input_data (from test case)

model_output (raw model return)

metadata (model parameters, prompt, etc.)

Evaluator returns a structured result (e.g., 0/1, numeric score, or JSON).

Store evaluator results alongside built-in evaluation results in the experiment summary.

### 📋 Use Cases

Classification Evaluation

User designs a prompt to classify text into categories.

Test data includes expected category labels.

Model returns a JSON response, e.g. { "class": "positive" }.

Code evaluator parses JSON, compares "class" field to expected label, and outputs 1 (match) or 0 (mismatch).

Custom Scoring Logic

Instead of binary match, evaluator computes similarity score (e.g., cosine similarity between embeddings, or string edit distance).

Outputs a float score (0.0–1.0).

Structured Output Validation

For tasks requiring structured JSON responses, evaluator can check for presence/validity of required fields, apply validation rules, and flag malformed outputs.

### ⚡ Priority

High - Would significantly improve my experience

### 🔧 Component

Evaluation

### 🔄 Alternatives Considered

_No response_

### 🎨 Mockups/Designs

_No response_

### ⚙️ Technical Details

_No response_

### ✅ Acceptance Criteria

_No response_

### 📝 Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Support Custom Code Evaluators for Prompt Experiment Results #141

📋 Please ensure that:

🎯 Problem Statement

💡 Proposed Solution

📋 Use Cases

⚡ Priority

🔧 Component

🔄 Alternatives Considered

🎨 Mockups/Designs

⚙️ Technical Details

✅ Acceptance Criteria

📝 Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Support Custom Code Evaluators for Prompt Experiment Results #141

Description

📋 Please ensure that:

🎯 Problem Statement

💡 Proposed Solution

📋 Use Cases

⚡ Priority

🔧 Component

🔄 Alternatives Considered

🎨 Mockups/Designs

⚙️ Technical Details

✅ Acceptance Criteria

📝 Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions