A curated list of tools, methods & platforms for evaluating AI quality in real applications.
A curated list of tools, frameworks, benchmarks, and observability platforms for evaluating LLMs, RAG pipelines, and autonomous agents to minimize hallucinations & evaluate practical performance in real production environments.
- Anthropic Model Evals
- Anthropic's evaluation suite for safety, capabilities, and alignment testing of language models.
- ColossalEval
- Unified pipeline for classic metrics plus GPT-assisted scoring across public datasets.
- DeepEval
- Python unit-test style metrics for hallucination, relevance, toxicity, and bias.
- Hugging Face lighteval
- Toolkit powering HF leaderboards with 1k+ tasks and pluggable metrics.
- Inspect AI
- UK AI Safety Institute framework for scripted eval plans, tool calls, and model-graded rubrics.
- MLflow Evaluators
- Eval API that logs LLM scores next to classic experiment tracking runs.
- OpenAI Evals
- Reference harness plus registry spanning reasoning, extraction, and safety evals.
- OpenCompass
- Research harness with CascadeEvaluator, CompassRank syncing, and LLM-as-judge utilities.
- Prompt Flow
- Flow builder with built-in evaluation DAGs, dataset runners, and CI hooks.
- Promptfoo
- Local-first CLI and dashboard for evaluating prompts, RAG flows, and agents with cost tracking and regression detection.
- Ragas
- Evaluation library that grades answers, context, and grounding with pluggable scorers.
- TruLens
- Feedback function framework for chains and agents with customizable judge models.
- W&B Weave Evaluations
- Managed evaluation orchestrator with dataset versioning and dashboards.
- ZenML
- Pipeline framework that bakes evaluation steps and guardrail metrics into LLM workflows.
- Braintrust
- Hosted evaluation workspace with CI-style regression tests, agent sandboxes, and token cost tracking.
- LangSmith
- Hosted tracing plus datasets, batched evals, and regression gating for LangChain apps.
- W&B Prompt Registry
- Prompt evaluation templates with reproducible scoring and reviews.
- EvalScope RAG
- Guides and templates that extend Ragas-style metrics with domain rubrics.
- LlamaIndex Evaluation
- Modules for replaying queries, scoring retrievers, and comparing query engines.
- Open RAG Eval
- Vectara harness with pluggable datasets for comparing retrievers and prompts.
- RAGEval
- Framework that auto-generates corpora, questions, and RAG rubrics for completeness.
- R-Eval
- Toolkit for robust RAG scoring aligned with the Evaluation of RAG survey taxonomy.
- BEIR
- Benchmark suite covering dense, sparse, and hybrid retrieval tasks.
- ColBERT
- Late-interaction dense retriever with evaluation scripts for IR datasets.
- MTEB
- Embeddings benchmark measuring retrieval, reranking, and similarity quality.
- Awesome-RAG-Evaluation
- Curated catalog of RAG evaluation metrics, datasets, and leaderboards.
- Comparing LLMs on Real-World Retrieval
- Empirical analysis of how language models perform on practical retrieval tasks.
- RAG Evaluation Survey
- Comprehensive paper covering metrics, judgments, and open problems for RAG.
- RAGTruth
- Human-annotated dataset for measuring hallucinations and faithfulness in RAG answers.
- AlpacaEval
- Automated instruction-following evaluator with length-controlled LLM judge scoring.
- ChainForge
- Visual IDE for comparing prompts, sampling models, and scoring batches with rubrics.
- Guardrails AI
- Declarative validation framework that enforces schemas, correction chains, and judgments.
- Lakera Guard
- Hosted prompt security platform with red-team datasets for jailbreak and injection testing.
- PromptBench
- Benchmark suite for adversarial prompt stress tests across diverse tasks.
- Red Teaming Handbook
- Microsoft playbook for adversarial prompt testing and mitigation patterns.
- Deepchecks Evaluation Playbook
- Survey of evaluation metrics, failure modes, and platform comparisons.
- HELM
- Holistic Evaluation of Language Models methodology emphasizing multi-criteria scoring.
- Instruction-Following Evaluation (IFEval)
- Constraint-verification prompts for automatically checking instruction compliance.
- OpenAI Cookbook Evals
- Practical notebooks showing how to build custom evals.
- Safety Evaluation Guides
- Cloud vendor recipes for testing quality, safety, and risk.
- Who Validates the Validators?
- EvalGen workflow aligning LLM judges with human rubrics via mixed-initiative criteria design.
- ZenML Evaluation Playbook
- Playbook for embedding eval gates into pipelines and deployments.
- Agenta
- End-to-end LLM developer platform for prompt engineering, evaluation, and deployment.
- Arize Phoenix
- OpenTelemetry-native observability and evaluation toolkit for RAG, LLMs, and agents.
- DocETL
- ETL system for complex document processing with LLMs and built-in quality checks.
- Giskard
- Testing framework for ML models with vulnerability scanning and LLM-specific detectors.
- Helicone
- Open-source LLM observability platform with cost tracking, caching, and evaluation tools.
- Langfuse
- Open-source LLM engineering platform providing tracing, eval dashboards, and prompt analytics.
- Lilac
- Data curation tool for exploring and enriching datasets with semantic search and clustering.
- LiteLLM
- Unified API for 100+ LLM providers with cost tracking, fallbacks, and load balancing.
- Lunary
- Production toolkit for LLM apps with tracing, prompt management, and evaluation pipelines.
- Mirascope
- Python toolkit for building LLM applications with structured outputs and evaluation utilities.
- OpenLIT
- Telemetry instrumentation for LLM apps with built-in quality metrics and guardrail hooks.
- OpenLLMetry
- OpenTelemetry instrumentation for LLM traces that feed any backend or custom eval logic.
- Opik
- Self-hostable evaluation and observability hub with datasets, scoring jobs, and interactive traces.
- Rhesis
- Collaborative testing platform with automated test generation and multi-turn conversation simulation for LLM and agentic applications.
- traceAI
- Open-source multi-modal tracing and diagnostics framework for LLM, RAG, and agent workflows built on OpenTelemetry.
- UpTrain
- OSS/hosted evaluation suite with 20+ checks, RCA tooling, and LlamaIndex integrations.
- VoltAgent
- TypeScript agent framework paired with VoltOps for trace inspection and regression testing.
- Zeno
- Data-centric evaluation UI for slicing failures, comparing prompts, and debugging retrieval quality.
- ChatIntel
- Conversation analytics platform for evaluating chatbot quality, sentiment, and user satisfaction.
- Confident AI
- DeepEval-backed platform for scheduled eval suites, guardrails, and production monitors.
- Datadog LLM Observability
- Datadog module capturing LLM traces, metrics, and safety signals.
- Deepchecks LLM Evaluation
- Managed eval suites with dataset versioning, dashboards, and alerting.
- Eppo
- Experimentation platform with AI-specific evaluation metrics and statistical rigor for LLM A/B testing.
- Future AGI
- Multi-modal evaluation, simulation, and optimization platform for reliable AI systems across software and hardware.
- Galileo
- Evaluation and data-curation studio with labeling, slicing, and issue triage.
- HoneyHive
- Evaluation and observability platform with prompt versioning, A/B testing, and fine-tuning workflows.
- Humanloop
- Production prompt management with human-in-the-loop evals and annotation queues.
- Maxim AI
- Evaluation and observability platform focusing on agent simulations and monitoring.
- Orq.ai
- LLM operations platform with prompt management, evaluation workflows, and deployment pipelines.
- PostHog LLM Analytics
- Product analytics toolkit extended to track custom LLM events and metrics.
- PromptLayer
- Prompt engineering platform with version control, evaluation tracking, and team collaboration.
- Amazon Bedrock Evaluations
- Managed service for scoring foundation models and RAG pipelines.
- Amazon Bedrock Guardrails
- Safety layer that evaluates prompts and responses for policy compliance.
- Azure AI Foundry Evaluations
- Evaluation flows and risk reports wired into Prompt Flow projects.
- Vertex AI Generative AI Evaluation
- Adaptive rubric-based evaluation for Google and third-party models.
- AGIEval
- Human-centric standardized exams spanning entrance tests, legal, and math scenarios.
- BIG-bench
- Collaborative benchmark probing reasoning, commonsense, and long-tail tasks.
- CommonGen-Eval
- GPT-4 judged CommonGen-lite suite for constrained commonsense text generation.
- DyVal
- Dynamic reasoning benchmark that varies difficulty and graph structure to stress models.
- LM Evaluation Harness
- Standard harness for scoring autoregressive models on dozens of tasks.
- LLM-Uncertainty-Bench
- Adds uncertainty-aware scoring across QA, RC, inference, dialog, and summarization.
- LLMBar
- Meta-eval testing whether LLM judges can spot instruction-following failures.
- LV-Eval
- Long-context suite with five length tiers up to 256K tokens and distraction controls.
- MMLU
- Massive multitask language understanding benchmark for academic and professional subjects.
- MMLU-Pro
- Harder 10-choice extension focused on reasoning-rich, low-leakage questions.
- PertEval
- Knowledge-invariant perturbations to debias multiple-choice accuracy inflation.
- FinEval
- Chinese financial QA and reasoning benchmark across regulation, accounting, and markets.
- LAiW
- Legal benchmark covering retrieval, foundation inference, and complex case applications in Chinese law.
- HumanEval
- Unit-test-based benchmark for code synthesis and docstring reasoning.
- MATH
- Competition-level math benchmark targeting multi-step symbolic reasoning.
- MBPP
- Mostly Basic Programming Problems benchmark for small coding tasks.
- AgentBench
- Evaluates LLMs acting as agents across simulated domains like games and coding.
- GAIA
- Tool-use benchmark requiring grounded reasoning with live web access and planning.
- MetaTool Tasks
- Tool-calling benchmark and eval harness for agents built around LLaMA models.
- SuperCLUE-Agent
- Chinese agent eval covering tool use, planning, long/short-term memory, and APIs.
- AdvBench
- Adversarial prompt benchmark for jailbreak and misuse resistance measurement.
- BBQ
- Bias-sensitive QA sets measuring stereotype reliance and ambiguous cases.
- ToxiGen
- Toxic language generation and classification benchmark for robustness checks.
- TruthfulQA
- Measures factuality and hallucination propensity via adversarially written questions.
- CompassRank
- OpenCompass leaderboard comparing frontier and research models across multi-domain suites.
- LLM Agents Benchmark Collections
- Aggregated leaderboard comparing multi-agent safety and reliability suites.
- Open LLM Leaderboard
- Hugging Face benchmark board with IFEval, MMLU-Pro, GPQA, and more.
- OpenAI Evals Registry
- Community suites and scores covering accuracy, safety, and instruction following.
- Scale SEAL Leaderboard
- Expert-rated leaderboard covering reasoning, coding, and safety via SEAL evaluations.
- AI Evals for Engineers & PMs
- Cohort course from Hamel & Shreya with lifetime reader, Discord, AI Eval Assistant, and live office hours.
- AlignEval
- Eugene Yan's guide on building LLM judges by following methodical alignment processes.
- Applied LLMs
- Practical lessons from a year of building with LLMs, emphasizing evaluation as a core practice.
- Data Flywheels for LLM Applications
- Iterative data improvement processes for building better LLM systems.
- Error Analysis & Prioritizing Next Steps
- Andrew Ng walkthrough showing how to slice traces and focus eval work via classic ML techniques.
- Error Analysis Before Tests
- Office hours notes on why error analysis should precede writing automated tests.
- Eval Tools Comparison
- Detailed comparison of evaluation tools including Braintrust, LangSmith, and Promptfoo.
- Evals for AI Engineers
- O'Reilly book by Shreya Shankar & Hamel Husain on systematic error analysis, evaluation pipelines, and LLM-as-a-judge.
- Evaluating RAG Systems
- Practical guidance on RAG evaluation covering retrieval quality and generation assessment.
- Field Guide to Rapidly Improving AI Products
- Comprehensive guide on error analysis, data viewers, and systematic improvement from 30+ implementations.
- Inspect AI Deep Dive
- Technical deep dive into Inspect AI framework with hands-on examples.
- LLM Evals FAQ
- Comprehensive FAQ with 45+ articles covering evaluation questions from practitioners.
- LLM Evaluators Survey
- Survey of LLM-as-judge use cases and approaches with practical implementation patterns.
- LLM-as-a-Judge Guide
- In-depth guide on using LLMs as judges for automated evaluation with calibration tips.
- Mastering LLMs Open Course
- Free 40+ hour course covering evals, RAG, and fine-tuning taught by 25+ industry practitioners.
- Modern IR Evals For RAG
- Why traditional IR evals are insufficient for RAG, covering BEIR and modern approaches.
- Multi-Turn Chat Evals
- Strategies for evaluating multi-turn conversational AI systems.
- Open Source LLM Tools Comparison
- PostHog comparison of open-source LLM observability and evaluation tools.
- Scoping LLM Evals
- Case study on managing evaluation complexity through proper scoping and topic distribution.
- Why AI evals are the hottest new skill
- Lenny's interview covering error analysis, axial coding, eval prompts, and PRD alignment.
- Your AI Product Needs Evals
- Foundational article on why every AI product needs systematic evaluation.
- Arize Phoenix AI Chatbot
- Next.js chatbot with Phoenix tracing, dataset replays, and evaluation jobs.
- Azure LLM Evaluation Samples
- Prompt Flow and Azure AI Foundry projects demonstrating hosted evals.
- Deepchecks QA over CSV
- Example agent wired to Deepchecks scoring plus tracing dashboards.
- OpenAI Evals Demo Evals
- Templates for extending OpenAI Evals with custom datasets.
- Promptfoo Examples
- Ready-made prompt regression suites for RAG, summarization, and agents.
- ZenML Projects
- End-to-end pipelines showing how to weave evaluation steps into LLMOps stacks.
- Awesome ChainForge
- Ecosystem list centered on ChainForge experiments and extensions.
- Awesome-LLM-Eval
- Cross-lingual (Chinese) compendium of eval tooling, papers, datasets, and leaderboards.
- Awesome LLMOps
- Curated tooling for training, deployment, and monitoring of LLM apps.
- Awesome Machine Learning
- Language-specific ML resources that often host evaluation building blocks.
- Awesome RAG
- Broad coverage of retrieval-augmented generation techniques and tools.
- Awesome Self-Hosted
- Massive catalog of self-hostable software, including observability stacks.
- GenAI Notes
- Continuously updated notes and resources on GenAI systems, evaluation, and operations.
Released under the CC0 1.0 Universal license.
Contributions are welcome—please read CONTRIBUTING.md for scope, entry rules, and the pull-request checklist before submitting updates.