Skip to content

☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications.

License

Notifications You must be signed in to change notification settings

Vvkmnn/awesome-ai-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome AI Eval Awesome

A curated list of tools, methods & platforms for evaluating AI quality in real applications.

Awesome AI Eval robot logo

A curated list of tools, frameworks, benchmarks, and observability platforms for evaluating LLMs, RAG pipelines, and autonomous agents to minimize hallucinations & evaluate practical performance in real production environments.

Contents


Tools

Evaluators and Test Harnesses

Core Frameworks

  • Anthropic Model Evals - Anthropic's evaluation suite for safety, capabilities, and alignment testing of language models.
  • ColossalEval - Unified pipeline for classic metrics plus GPT-assisted scoring across public datasets.
  • DeepEval - Python unit-test style metrics for hallucination, relevance, toxicity, and bias.
  • Hugging Face lighteval - Toolkit powering HF leaderboards with 1k+ tasks and pluggable metrics.
  • Inspect AI - UK AI Safety Institute framework for scripted eval plans, tool calls, and model-graded rubrics.
  • MLflow Evaluators - Eval API that logs LLM scores next to classic experiment tracking runs.
  • OpenAI Evals - Reference harness plus registry spanning reasoning, extraction, and safety evals.
  • OpenCompass - Research harness with CascadeEvaluator, CompassRank syncing, and LLM-as-judge utilities.
  • Prompt Flow - Flow builder with built-in evaluation DAGs, dataset runners, and CI hooks.
  • Promptfoo - Local-first CLI and dashboard for evaluating prompts, RAG flows, and agents with cost tracking and regression detection.
  • Ragas - Evaluation library that grades answers, context, and grounding with pluggable scorers.
  • TruLens - Feedback function framework for chains and agents with customizable judge models.
  • W&B Weave Evaluations - Managed evaluation orchestrator with dataset versioning and dashboards.
  • ZenML - Pipeline framework that bakes evaluation steps and guardrail metrics into LLM workflows.

Application and Agent Harnesses

  • Braintrust - Hosted evaluation workspace with CI-style regression tests, agent sandboxes, and token cost tracking.
  • LangSmith - Hosted tracing plus datasets, batched evals, and regression gating for LangChain apps.
  • W&B Prompt Registry - Prompt evaluation templates with reproducible scoring and reviews.

RAG and Retrieval

RAG Frameworks

  • EvalScope RAG - Guides and templates that extend Ragas-style metrics with domain rubrics.
  • LlamaIndex Evaluation - Modules for replaying queries, scoring retrievers, and comparing query engines.
  • Open RAG Eval - Vectara harness with pluggable datasets for comparing retrievers and prompts.
  • RAGEval - Framework that auto-generates corpora, questions, and RAG rubrics for completeness.
  • R-Eval - Toolkit for robust RAG scoring aligned with the Evaluation of RAG survey taxonomy.

Retrieval Benchmarks

  • BEIR - Benchmark suite covering dense, sparse, and hybrid retrieval tasks.
  • ColBERT - Late-interaction dense retriever with evaluation scripts for IR datasets.
  • MTEB - Embeddings benchmark measuring retrieval, reranking, and similarity quality.

RAG Datasets and Surveys

Prompt Evaluation & Safety

  • AlpacaEval - Automated instruction-following evaluator with length-controlled LLM judge scoring.
  • ChainForge - Visual IDE for comparing prompts, sampling models, and scoring batches with rubrics.
  • Guardrails AI - Declarative validation framework that enforces schemas, correction chains, and judgments.
  • Lakera Guard - Hosted prompt security platform with red-team datasets for jailbreak and injection testing.
  • PromptBench - Benchmark suite for adversarial prompt stress tests across diverse tasks.
  • Red Teaming Handbook - Microsoft playbook for adversarial prompt testing and mitigation patterns.

Datasets and Methodology


Platforms

Open Source Platforms

  • Agenta - End-to-end LLM developer platform for prompt engineering, evaluation, and deployment.
  • Arize Phoenix - OpenTelemetry-native observability and evaluation toolkit for RAG, LLMs, and agents.
  • DocETL - ETL system for complex document processing with LLMs and built-in quality checks.
  • Giskard - Testing framework for ML models with vulnerability scanning and LLM-specific detectors.
  • Helicone - Open-source LLM observability platform with cost tracking, caching, and evaluation tools.
  • Langfuse - Open-source LLM engineering platform providing tracing, eval dashboards, and prompt analytics.
  • Lilac - Data curation tool for exploring and enriching datasets with semantic search and clustering.
  • LiteLLM - Unified API for 100+ LLM providers with cost tracking, fallbacks, and load balancing.
  • Lunary - Production toolkit for LLM apps with tracing, prompt management, and evaluation pipelines.
  • Mirascope - Python toolkit for building LLM applications with structured outputs and evaluation utilities.
  • OpenLIT - Telemetry instrumentation for LLM apps with built-in quality metrics and guardrail hooks.
  • OpenLLMetry - OpenTelemetry instrumentation for LLM traces that feed any backend or custom eval logic.
  • Opik - Self-hostable evaluation and observability hub with datasets, scoring jobs, and interactive traces.
  • Rhesis - Collaborative testing platform with automated test generation and multi-turn conversation simulation for LLM and agentic applications.
  • traceAI - Open-source multi-modal tracing and diagnostics framework for LLM, RAG, and agent workflows built on OpenTelemetry.
  • UpTrain - OSS/hosted evaluation suite with 20+ checks, RCA tooling, and LlamaIndex integrations.
  • VoltAgent - TypeScript agent framework paired with VoltOps for trace inspection and regression testing.
  • Zeno - Data-centric evaluation UI for slicing failures, comparing prompts, and debugging retrieval quality.

Hosted Platforms

  • ChatIntel - Conversation analytics platform for evaluating chatbot quality, sentiment, and user satisfaction.
  • Confident AI - DeepEval-backed platform for scheduled eval suites, guardrails, and production monitors.
  • Datadog LLM Observability - Datadog module capturing LLM traces, metrics, and safety signals.
  • Deepchecks LLM Evaluation - Managed eval suites with dataset versioning, dashboards, and alerting.
  • Eppo - Experimentation platform with AI-specific evaluation metrics and statistical rigor for LLM A/B testing.
  • Future AGI - Multi-modal evaluation, simulation, and optimization platform for reliable AI systems across software and hardware.
  • Galileo - Evaluation and data-curation studio with labeling, slicing, and issue triage.
  • HoneyHive - Evaluation and observability platform with prompt versioning, A/B testing, and fine-tuning workflows.
  • Humanloop - Production prompt management with human-in-the-loop evals and annotation queues.
  • Maxim AI - Evaluation and observability platform focusing on agent simulations and monitoring.
  • Orq.ai - LLM operations platform with prompt management, evaluation workflows, and deployment pipelines.
  • PostHog LLM Analytics - Product analytics toolkit extended to track custom LLM events and metrics.
  • PromptLayer - Prompt engineering platform with version control, evaluation tracking, and team collaboration.

Cloud Platforms


Benchmarks

General

  • AGIEval - Human-centric standardized exams spanning entrance tests, legal, and math scenarios.
  • BIG-bench - Collaborative benchmark probing reasoning, commonsense, and long-tail tasks.
  • CommonGen-Eval - GPT-4 judged CommonGen-lite suite for constrained commonsense text generation.
  • DyVal - Dynamic reasoning benchmark that varies difficulty and graph structure to stress models.
  • LM Evaluation Harness - Standard harness for scoring autoregressive models on dozens of tasks.
  • LLM-Uncertainty-Bench - Adds uncertainty-aware scoring across QA, RC, inference, dialog, and summarization.
  • LLMBar - Meta-eval testing whether LLM judges can spot instruction-following failures.
  • LV-Eval - Long-context suite with five length tiers up to 256K tokens and distraction controls.
  • MMLU - Massive multitask language understanding benchmark for academic and professional subjects.
  • MMLU-Pro - Harder 10-choice extension focused on reasoning-rich, low-leakage questions.
  • PertEval - Knowledge-invariant perturbations to debias multiple-choice accuracy inflation.

Domain

  • FinEval - Chinese financial QA and reasoning benchmark across regulation, accounting, and markets.
  • LAiW - Legal benchmark covering retrieval, foundation inference, and complex case applications in Chinese law.
  • HumanEval - Unit-test-based benchmark for code synthesis and docstring reasoning.
  • MATH - Competition-level math benchmark targeting multi-step symbolic reasoning.
  • MBPP - Mostly Basic Programming Problems benchmark for small coding tasks.

Agent

  • AgentBench - Evaluates LLMs acting as agents across simulated domains like games and coding.
  • GAIA - Tool-use benchmark requiring grounded reasoning with live web access and planning.
  • MetaTool Tasks - Tool-calling benchmark and eval harness for agents built around LLaMA models.
  • SuperCLUE-Agent - Chinese agent eval covering tool use, planning, long/short-term memory, and APIs.

Safety

  • AdvBench - Adversarial prompt benchmark for jailbreak and misuse resistance measurement.
  • BBQ - Bias-sensitive QA sets measuring stereotype reliance and ambiguous cases.
  • ToxiGen - Toxic language generation and classification benchmark for robustness checks.
  • TruthfulQA - Measures factuality and hallucination propensity via adversarially written questions.

Leaderboards


Resources

Guides & Training

Examples

Related Collections

  • Awesome ChainForge - Ecosystem list centered on ChainForge experiments and extensions.
  • Awesome-LLM-Eval - Cross-lingual (Chinese) compendium of eval tooling, papers, datasets, and leaderboards.
  • Awesome LLMOps - Curated tooling for training, deployment, and monitoring of LLM apps.
  • Awesome Machine Learning - Language-specific ML resources that often host evaluation building blocks.
  • Awesome RAG - Broad coverage of retrieval-augmented generation techniques and tools.
  • Awesome Self-Hosted - Massive catalog of self-hostable software, including observability stacks.
  • GenAI Notes - Continuously updated notes and resources on GenAI systems, evaluation, and operations.

Licensing

Released under the CC0 1.0 Universal license.


Contributing

Contributions are welcome—please read CONTRIBUTING.md for scope, entry rules, and the pull-request checklist before submitting updates.

✌️