ai-evaluation-framework

Here are 12 public repositories matching this topic...

Vvkmnn / awesome-ai-eval

☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications.

Updated Nov 27, 2025

hparreao / Awesome-AI-Evaluation-Guide

A comprehensive, implementation-focused guide to evaluating Large Language Models, RAG systems, and Agentic AI in production environments.

awesome gpt evaluation-metrics evaluation-framework awesome-lists claude ai-evaluation large-language-models llm agentic-ai ai-evaluation-tools ai-evaluation-metrics ai-evaluation-framework

Updated Dec 5, 2025

firstlinesoftware / eval-ai-library

Star

Comprehensive AI Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework

Updated Dec 10, 2025
Python

SS47816 / AGI-Elo

Star

[NeurIPS 2025] AGI-Elo: How Far Are We From Mastering A Task?

benchmark leaderboard agi imagenet coco artificial-general-intelligence datasets evaluation-metrics elo-rating rating-system evaluation-framework sota ai-benchmarks waymo-open-dataset mmlu vision-language-action ai-evaluation-framework livecodebench navsim

Updated Oct 28, 2025
Python

syamsasi99 / prompt-evaluator

Star

prompt-evaluator is an open-source toolkit for evaluating, testing, and comparing LLM prompts. It provides a GUI-driven workflow for running prompt tests, tracking token usage, visualizing results, and ensuring reliability across models like OpenAI, Claude, and Gemini.

electron react typescript datascience developer-tools ai-evaluation llm prompt-engineering prompt-testing promptfoo ai-evaluation-tools ai-evaluation-metrics ai-evaluation-framework

Updated Dec 4, 2025
TypeScript

meshkovQA / Eval-ai-library

Star

Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework

Updated Dec 10, 2025
Python

mbayers6370 / ALIGN-framework

Star

Multi-dimensional evaluation of AI responses using semantic alignment, conversational flow, and engagement metrics.

human-in-the-loop emotional-analysis contextual-ai llm-evaluation emotional-alignment ai-evaluation-framework

Updated Oct 29, 2025
Python

lalitkpal / VerifyAI

Star

VerifyAI is a simple UI application to test GenAI outputs

ai-evaluation llm generative-ai genai llm-test llm-evaluation llm-evaluation-framework llm-evaluation-metrics llm-testing ai-metrics ai-evaluation-framework generative-ai-evaluation

Updated Sep 5, 2025
Python

PabloCabaleiro / pondera

Star

Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.

python ai agents model-agnostic ai-evaluation llms llm-evaluation llm-evaluation-framework llm-judge agent-evaluation ai-evaluation-framework rubric-based-evaluation yaml-first

Updated Oct 23, 2025
Python

ZhaoJackson / PsyChat

Star

Clinical trial application for mental health benchmark evaluation of AI responses in multi-turn conversations. Guides users to understand AI interaction patterns and resolve personal mental health issues through therapeutic AI assistance.

sentiment-analysis meteor clinical-trials rouge mental-health bleu-score ethical streamlit bert-fine-tuning azure-openai ai-evaluation-framework benchmark-evaluation-llms multi-turn-conversations

Updated Oct 23, 2025
Python

MirrorLoop / mirrorloop-core

Star

Official public release of MirrorLoop Core (v1.3 – April 2025)

framework ai gpt language-model semantic-testing llm-testing recursive-ai ai-evaluation-framework mirrorloop

Updated Apr 2, 2025

SamiMelhem / CustomBench

Star

Web app & CLI for benchmarking LLMs via OpenRouter. Test multiple models simultaneously with custom benchmarks, live progress tracking, and detailed results analysis.

bun typescript-react ai-evaluation openrouter llm-benchmarking ai-evaluation-framework nextjs16-typescript

Updated Dec 15, 2025
TypeScript

Improve this page

Add a description, image, and links to the ai-evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-evaluation-framework topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-evaluation-framework

Here are 12 public repositories matching this topic...

Vvkmnn / awesome-ai-eval

hparreao / Awesome-AI-Evaluation-Guide

firstlinesoftware / eval-ai-library

SS47816 / AGI-Elo

syamsasi99 / prompt-evaluator

meshkovQA / Eval-ai-library

mbayers6370 / ALIGN-framework

lalitkpal / VerifyAI

PabloCabaleiro / pondera

ZhaoJackson / PsyChat

MirrorLoop / mirrorloop-core

SamiMelhem / CustomBench

Improve this page

Add this topic to your repo