Awesome AI Eval

A curated list of tools, methods & platforms for evaluating AI quality in real applications.

A curated list of tools, frameworks, benchmarks, and observability platforms for evaluating LLMs, RAG pipelines, and autonomous agents to minimize hallucinations & evaluate practical performance in real production environments.

Tools

Evaluators and Test Harnesses

Core Frameworks

Anthropic Model Evals - Anthropic's evaluation suite for safety, capabilities, and alignment testing of language models.
ColossalEval - Unified pipeline for classic metrics plus GPT-assisted scoring across public datasets.
DeepEval - Python unit-test style metrics for hallucination, relevance, toxicity, and bias.
Hugging Face lighteval - Toolkit powering HF leaderboards with 1k+ tasks and pluggable metrics.
Inspect AI - UK AI Safety Institute framework for scripted eval plans, tool calls, and model-graded rubrics.
MLflow Evaluators - Eval API that logs LLM scores next to classic experiment tracking runs.
OpenAI Evals - Reference harness plus registry spanning reasoning, extraction, and safety evals.
OpenCompass - Research harness with CascadeEvaluator, CompassRank syncing, and LLM-as-judge utilities.
Prompt Flow - Flow builder with built-in evaluation DAGs, dataset runners, and CI hooks.
Promptfoo - Local-first CLI and dashboard for evaluating prompts, RAG flows, and agents with cost tracking and regression detection.
Ragas - Evaluation library that grades answers, context, and grounding with pluggable scorers.
TruLens - Feedback function framework for chains and agents with customizable judge models.
W&B Weave Evaluations - Managed evaluation orchestrator with dataset versioning and dashboards.
ZenML - Pipeline framework that bakes evaluation steps and guardrail metrics into LLM workflows.

Application and Agent Harnesses

Braintrust - Hosted evaluation workspace with CI-style regression tests, agent sandboxes, and token cost tracking.
LangSmith - Hosted tracing plus datasets, batched evals, and regression gating for LangChain apps.
W&B Prompt Registry - Prompt evaluation templates with reproducible scoring and reviews.

RAG and Retrieval

RAG Frameworks

EvalScope RAG - Guides and templates that extend Ragas-style metrics with domain rubrics.
LlamaIndex Evaluation - Modules for replaying queries, scoring retrievers, and comparing query engines.
Open RAG Eval - Vectara harness with pluggable datasets for comparing retrievers and prompts.
RAGEval - Framework that auto-generates corpora, questions, and RAG rubrics for completeness.
R-Eval - Toolkit for robust RAG scoring aligned with the Evaluation of RAG survey taxonomy.

Retrieval Benchmarks

BEIR - Benchmark suite covering dense, sparse, and hybrid retrieval tasks.
ColBERT - Late-interaction dense retriever with evaluation scripts for IR datasets.
MTEB - Embeddings benchmark measuring retrieval, reranking, and similarity quality.

RAG Datasets and Surveys

Awesome-RAG-Evaluation - Curated catalog of RAG evaluation metrics, datasets, and leaderboards.
Comparing LLMs on Real-World Retrieval - Empirical analysis of how language models perform on practical retrieval tasks.
RAG Evaluation Survey - Comprehensive paper covering metrics, judgments, and open problems for RAG.
RAGTruth - Human-annotated dataset for measuring hallucinations and faithfulness in RAG answers.

Prompt Evaluation & Safety

AlpacaEval - Automated instruction-following evaluator with length-controlled LLM judge scoring.
ChainForge - Visual IDE for comparing prompts, sampling models, and scoring batches with rubrics.
Guardrails AI - Declarative validation framework that enforces schemas, correction chains, and judgments.
Lakera Guard - Hosted prompt security platform with red-team datasets for jailbreak and injection testing.
PromptBench - Benchmark suite for adversarial prompt stress tests across diverse tasks.
Red Teaming Handbook - Microsoft playbook for adversarial prompt testing and mitigation patterns.

Datasets and Methodology

Deepchecks Evaluation Playbook - Survey of evaluation metrics, failure modes, and platform comparisons.
HELM - Holistic Evaluation of Language Models methodology emphasizing multi-criteria scoring.
Instruction-Following Evaluation (IFEval) - Constraint-verification prompts for automatically checking instruction compliance.
OpenAI Cookbook Evals - Practical notebooks showing how to build custom evals.
Safety Evaluation Guides - Cloud vendor recipes for testing quality, safety, and risk.
Who Validates the Validators? - EvalGen workflow aligning LLM judges with human rubrics via mixed-initiative criteria design.
ZenML Evaluation Playbook - Playbook for embedding eval gates into pipelines and deployments.

Platforms

Open Source Platforms

Agenta - End-to-end LLM developer platform for prompt engineering, evaluation, and deployment.
Arize Phoenix - OpenTelemetry-native observability and evaluation toolkit for RAG, LLMs, and agents.
DocETL - ETL system for complex document processing with LLMs and built-in quality checks.
Giskard - Testing framework for ML models with vulnerability scanning and LLM-specific detectors.
Helicone - Open-source LLM observability platform with cost tracking, caching, and evaluation tools.
Langfuse - Open-source LLM engineering platform providing tracing, eval dashboards, and prompt analytics.
Lilac - Data curation tool for exploring and enriching datasets with semantic search and clustering.
LiteLLM - Unified API for 100+ LLM providers with cost tracking, fallbacks, and load balancing.
Lunary - Production toolkit for LLM apps with tracing, prompt management, and evaluation pipelines.
Mirascope - Python toolkit for building LLM applications with structured outputs and evaluation utilities.
OpenLIT - Telemetry instrumentation for LLM apps with built-in quality metrics and guardrail hooks.
OpenLLMetry - OpenTelemetry instrumentation for LLM traces that feed any backend or custom eval logic.
Opik - Self-hostable evaluation and observability hub with datasets, scoring jobs, and interactive traces.
Rhesis - Collaborative testing platform with automated test generation and multi-turn conversation simulation for LLM and agentic applications.
traceAI - Open-source multi-modal tracing and diagnostics framework for LLM, RAG, and agent workflows built on OpenTelemetry.
UpTrain - OSS/hosted evaluation suite with 20+ checks, RCA tooling, and LlamaIndex integrations.
VoltAgent - TypeScript agent framework paired with VoltOps for trace inspection and regression testing.
Zeno - Data-centric evaluation UI for slicing failures, comparing prompts, and debugging retrieval quality.

Hosted Platforms

ChatIntel - Conversation analytics platform for evaluating chatbot quality, sentiment, and user satisfaction.
Confident AI - DeepEval-backed platform for scheduled eval suites, guardrails, and production monitors.
Datadog LLM Observability - Datadog module capturing LLM traces, metrics, and safety signals.
Deepchecks LLM Evaluation - Managed eval suites with dataset versioning, dashboards, and alerting.
Eppo - Experimentation platform with AI-specific evaluation metrics and statistical rigor for LLM A/B testing.
Future AGI - Multi-modal evaluation, simulation, and optimization platform for reliable AI systems across software and hardware.
Galileo - Evaluation and data-curation studio with labeling, slicing, and issue triage.
HoneyHive - Evaluation and observability platform with prompt versioning, A/B testing, and fine-tuning workflows.
Humanloop - Production prompt management with human-in-the-loop evals and annotation queues.
Maxim AI - Evaluation and observability platform focusing on agent simulations and monitoring.
Orq.ai - LLM operations platform with prompt management, evaluation workflows, and deployment pipelines.
PostHog LLM Analytics - Product analytics toolkit extended to track custom LLM events and metrics.
PromptLayer - Prompt engineering platform with version control, evaluation tracking, and team collaboration.

Cloud Platforms

Amazon Bedrock Evaluations - Managed service for scoring foundation models and RAG pipelines.
Amazon Bedrock Guardrails - Safety layer that evaluates prompts and responses for policy compliance.
Azure AI Foundry Evaluations - Evaluation flows and risk reports wired into Prompt Flow projects.
Vertex AI Generative AI Evaluation - Adaptive rubric-based evaluation for Google and third-party models.

Benchmarks

General

AGIEval - Human-centric standardized exams spanning entrance tests, legal, and math scenarios.
BIG-bench - Collaborative benchmark probing reasoning, commonsense, and long-tail tasks.
CommonGen-Eval - GPT-4 judged CommonGen-lite suite for constrained commonsense text generation.
DyVal - Dynamic reasoning benchmark that varies difficulty and graph structure to stress models.
LM Evaluation Harness - Standard harness for scoring autoregressive models on dozens of tasks.
LLM-Uncertainty-Bench - Adds uncertainty-aware scoring across QA, RC, inference, dialog, and summarization.
LLMBar - Meta-eval testing whether LLM judges can spot instruction-following failures.
LV-Eval - Long-context suite with five length tiers up to 256K tokens and distraction controls.
MMLU - Massive multitask language understanding benchmark for academic and professional subjects.
MMLU-Pro - Harder 10-choice extension focused on reasoning-rich, low-leakage questions.
PertEval - Knowledge-invariant perturbations to debias multiple-choice accuracy inflation.

Domain

FinEval - Chinese financial QA and reasoning benchmark across regulation, accounting, and markets.
LAiW - Legal benchmark covering retrieval, foundation inference, and complex case applications in Chinese law.
HumanEval - Unit-test-based benchmark for code synthesis and docstring reasoning.
MATH - Competition-level math benchmark targeting multi-step symbolic reasoning.
MBPP - Mostly Basic Programming Problems benchmark for small coding tasks.

Agent

AgentBench - Evaluates LLMs acting as agents across simulated domains like games and coding.
GAIA - Tool-use benchmark requiring grounded reasoning with live web access and planning.
MetaTool Tasks - Tool-calling benchmark and eval harness for agents built around LLaMA models.
SuperCLUE-Agent - Chinese agent eval covering tool use, planning, long/short-term memory, and APIs.

Safety

AdvBench - Adversarial prompt benchmark for jailbreak and misuse resistance measurement.
BBQ - Bias-sensitive QA sets measuring stereotype reliance and ambiguous cases.
ToxiGen - Toxic language generation and classification benchmark for robustness checks.
TruthfulQA - Measures factuality and hallucination propensity via adversarially written questions.

Leaderboards

CompassRank - OpenCompass leaderboard comparing frontier and research models across multi-domain suites.
LLM Agents Benchmark Collections - Aggregated leaderboard comparing multi-agent safety and reliability suites.
Open LLM Leaderboard - Hugging Face benchmark board with IFEval, MMLU-Pro, GPQA, and more.
OpenAI Evals Registry - Community suites and scores covering accuracy, safety, and instruction following.
Scale SEAL Leaderboard - Expert-rated leaderboard covering reasoning, coding, and safety via SEAL evaluations.

Resources

Guides & Training

AI Evals for Engineers & PMs - Cohort course from Hamel & Shreya with lifetime reader, Discord, AI Eval Assistant, and live office hours.
AlignEval - Eugene Yan's guide on building LLM judges by following methodical alignment processes.
Applied LLMs - Practical lessons from a year of building with LLMs, emphasizing evaluation as a core practice.
Data Flywheels for LLM Applications - Iterative data improvement processes for building better LLM systems.
Error Analysis & Prioritizing Next Steps - Andrew Ng walkthrough showing how to slice traces and focus eval work via classic ML techniques.
Error Analysis Before Tests - Office hours notes on why error analysis should precede writing automated tests.
Eval Tools Comparison - Detailed comparison of evaluation tools including Braintrust, LangSmith, and Promptfoo.
Evals for AI Engineers - O'Reilly book by Shreya Shankar & Hamel Husain on systematic error analysis, evaluation pipelines, and LLM-as-a-judge.
Evaluating RAG Systems - Practical guidance on RAG evaluation covering retrieval quality and generation assessment.
Field Guide to Rapidly Improving AI Products - Comprehensive guide on error analysis, data viewers, and systematic improvement from 30+ implementations.
Inspect AI Deep Dive - Technical deep dive into Inspect AI framework with hands-on examples.
LLM Evals FAQ - Comprehensive FAQ with 45+ articles covering evaluation questions from practitioners.
LLM Evaluators Survey - Survey of LLM-as-judge use cases and approaches with practical implementation patterns.
LLM-as-a-Judge Guide - In-depth guide on using LLMs as judges for automated evaluation with calibration tips.
Mastering LLMs Open Course - Free 40+ hour course covering evals, RAG, and fine-tuning taught by 25+ industry practitioners.
Modern IR Evals For RAG - Why traditional IR evals are insufficient for RAG, covering BEIR and modern approaches.
Multi-Turn Chat Evals - Strategies for evaluating multi-turn conversational AI systems.
Open Source LLM Tools Comparison - PostHog comparison of open-source LLM observability and evaluation tools.
Scoping LLM Evals - Case study on managing evaluation complexity through proper scoping and topic distribution.
Why AI evals are the hottest new skill - Lenny's interview covering error analysis, axial coding, eval prompts, and PRD alignment.
Your AI Product Needs Evals - Foundational article on why every AI product needs systematic evaluation.

Examples

Arize Phoenix AI Chatbot - Next.js chatbot with Phoenix tracing, dataset replays, and evaluation jobs.
Azure LLM Evaluation Samples - Prompt Flow and Azure AI Foundry projects demonstrating hosted evals.
Deepchecks QA over CSV - Example agent wired to Deepchecks scoring plus tracing dashboards.
OpenAI Evals Demo Evals - Templates for extending OpenAI Evals with custom datasets.
Promptfoo Examples - Ready-made prompt regression suites for RAG, summarization, and agents.
ZenML Projects - End-to-end pipelines showing how to weave evaluation steps into LLMOps stacks.

Related Collections

Awesome ChainForge - Ecosystem list centered on ChainForge experiments and extensions.
Awesome-LLM-Eval - Cross-lingual (Chinese) compendium of eval tooling, papers, datasets, and leaderboards.
Awesome LLMOps - Curated tooling for training, deployment, and monitoring of LLM apps.
Awesome Machine Learning - Language-specific ML resources that often host evaluation building blocks.
Awesome RAG - Broad coverage of retrieval-augmented generation techniques and tools.
Awesome Self-Hosted - Massive catalog of self-hostable software, including observability stacks.
GenAI Notes - Continuously updated notes and resources on GenAI systems, evaluation, and operations.

Licensing

Released under the CC0 1.0 Universal license.

Contributing

Contributions are welcome—please read CONTRIBUTING.md for scope, entry rules, and the pull-request checklist before submitting updates.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome AI Eval

Contents

Tools

Evaluators and Test Harnesses

Core Frameworks

Application and Agent Harnesses

RAG and Retrieval

RAG Frameworks

Retrieval Benchmarks

RAG Datasets and Surveys

Prompt Evaluation & Safety

Datasets and Methodology

Platforms

Open Source Platforms

Hosted Platforms

Cloud Platforms

Benchmarks

General

Domain

Agent

Safety

Leaderboards

Resources

Guides & Training

Examples

Related Collections

Licensing

Contributing

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

License

Vvkmnn/awesome-ai-eval

Folders and files

Latest commit

History

Repository files navigation

Awesome AI Eval

Contents

Tools

Evaluators and Test Harnesses

Core Frameworks

Application and Agent Harnesses

RAG and Retrieval

RAG Frameworks

Retrieval Benchmarks

RAG Datasets and Surveys

Prompt Evaluation & Safety

Datasets and Methodology

Platforms

Open Source Platforms

Hosted Platforms

Cloud Platforms

Benchmarks

General

Domain

Agent

Safety

Leaderboards

Resources

Guides & Training

Examples

Related Collections

Licensing

Contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Packages