-
Notifications
You must be signed in to change notification settings - Fork 655
Description
📋 Please ensure that:
- I have searched existing issues to avoid duplicates
- I have provided a clear problem statement and solution
- I understand this is a feature request and not a bug report
- I am willing to help implement this feature if needed
- I have submitted this feature request in English (otherwise it will not be processed)
🎯 Problem Statement
Currently, evaluation relies mainly on LLM evaluators . However, in many real-world scenarios, users need to define custom evaluation logic that compares model outputs against expected results in a more flexible way.
We propose supporting a “Code Evaluator” feature, where users can submit a piece of Python (or other supported languages) code that executes custom logic based on:
Evaluation Input Data (e.g., ground truth, expected labels).
Model Return Result (e.g., JSON output).
Optional Model Parameters / Metadata.
The custom code should be able to parse the model’s output, apply domain-specific evaluation rules, and generate a final evaluation result (e.g., 0/1, a score, or even a custom label).
💡 Proposed Solution
Allow users to define evaluators by writing code (starting with Python support).
Provide execution environment to run the evaluator code for each evaluation case.
Inputs passed into evaluator:
input_data (from test case)
model_output (raw model return)
metadata (model parameters, prompt, etc.)
Evaluator returns a structured result (e.g., 0/1, numeric score, or JSON).
Store evaluator results alongside built-in evaluation results in the experiment summary.
📋 Use Cases
Classification Evaluation
User designs a prompt to classify text into categories.
Test data includes expected category labels.
Model returns a JSON response, e.g. { "class": "positive" }.
Code evaluator parses JSON, compares "class" field to expected label, and outputs 1 (match) or 0 (mismatch).
Custom Scoring Logic
Instead of binary match, evaluator computes similarity score (e.g., cosine similarity between embeddings, or string edit distance).
Outputs a float score (0.0–1.0).
Structured Output Validation
For tasks requiring structured JSON responses, evaluator can check for presence/validity of required fields, apply validation rules, and flag malformed outputs.
⚡ Priority
High - Would significantly improve my experience
🔧 Component
Evaluation
🔄 Alternatives Considered
No response
🎨 Mockups/Designs
No response
⚙️ Technical Details
No response
✅ Acceptance Criteria
No response
📝 Additional Context
No response