Skip to content

[FEATURE] Support Custom Code Evaluators for Prompt Experiment Results #141

@xiaohan-0114

Description

@xiaohan-0114

📋 Please ensure that:

  • I have searched existing issues to avoid duplicates
  • I have provided a clear problem statement and solution
  • I understand this is a feature request and not a bug report
  • I am willing to help implement this feature if needed
  • I have submitted this feature request in English (otherwise it will not be processed)

🎯 Problem Statement

Currently, evaluation relies mainly on LLM evaluators . However, in many real-world scenarios, users need to define custom evaluation logic that compares model outputs against expected results in a more flexible way.

We propose supporting a “Code Evaluator” feature, where users can submit a piece of Python (or other supported languages) code that executes custom logic based on:

Evaluation Input Data (e.g., ground truth, expected labels).

Model Return Result (e.g., JSON output).

Optional Model Parameters / Metadata.

The custom code should be able to parse the model’s output, apply domain-specific evaluation rules, and generate a final evaluation result (e.g., 0/1, a score, or even a custom label).

💡 Proposed Solution

Allow users to define evaluators by writing code (starting with Python support).

Provide execution environment to run the evaluator code for each evaluation case.

Inputs passed into evaluator:

input_data (from test case)

model_output (raw model return)

metadata (model parameters, prompt, etc.)

Evaluator returns a structured result (e.g., 0/1, numeric score, or JSON).

Store evaluator results alongside built-in evaluation results in the experiment summary.

📋 Use Cases

Classification Evaluation

User designs a prompt to classify text into categories.

Test data includes expected category labels.

Model returns a JSON response, e.g. { "class": "positive" }.

Code evaluator parses JSON, compares "class" field to expected label, and outputs 1 (match) or 0 (mismatch).

Custom Scoring Logic

Instead of binary match, evaluator computes similarity score (e.g., cosine similarity between embeddings, or string edit distance).

Outputs a float score (0.0–1.0).

Structured Output Validation

For tasks requiring structured JSON responses, evaluator can check for presence/validity of required fields, apply validation rules, and flag malformed outputs.

⚡ Priority

High - Would significantly improve my experience

🔧 Component

Evaluation

🔄 Alternatives Considered

No response

🎨 Mockups/Designs

No response

⚙️ Technical Details

No response

✅ Acceptance Criteria

No response

📝 Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions