Evals, Benchmarks, and Metrics
Definition
Evaluation frameworks and benchmarks quantify model quality, safety, and performance across tasks and modalities.
Why It Matters
- Enables model selection and regression detection
- Informs cost–quality tradeoffs and capacity planning
- Essential for safety monitoring and compliance
2025 State of the Art
- Text: MMLU, MT-Bench, HumanEval, BIG-bench, HELM style leaderboards
- Image: ImageNet zero-shot, VQAv2, DocVQA; diffusion quality via FID/CLIPScore/human
- Video: VBench, T2VBench; temporal consistency and instruction following
- Audio/Speech/Music: ASR WER (LibriSpeech), MOS for TTS, MusicCaps/subjective panels
- Multimodal: MMMU, MathVista, Chart QA; human eval at scale with annotation
Key Players
- LMSys, OpenAI Evals, Hugging Face Open LLM Leaderboard
- Google/DeepMind, Anthropic, Meta; academic labs
Challenges
- Prompt leakage and overfitting to public benchmarks
- Poor correlation between automatic metrics and human quality
- Reproducibility across toolchains and decoding settings
Reference Architectures
- CI eval pipeline with curated suites, randomization, and seeded decoding
- Human-in-the-loop annotation and rubric-based scoring
- Telemetry for live quality (win rates, complaint rate)
Opportunities
- Task-specific evals (domain correctness and safety)
- Multi-turn and tool-use evals beyond single-turn chat
- Multimodal ground-truth creation and automated raters
Design Checklist & Acceptance Criteria
- Define win-rate targets vs. baselines (A/B)
- Separate dev/test to avoid overfitting; rotate hidden sets
- Track decoding params and seed for reproducibility
- Add safety evals (jailbreak, sensitive categories)
- Report costs/latency alongside quality
References
- Title: MT-Bench and Chatbot Arena URL: https://lmsys.org/blog/2023-05-03-arena/ Publisher/Vendor: LMSys Accessed: 2025-08-14 Version_or_release: provider_reported
- Title: OpenAI Evals URL: https://github.com/openai/evals Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: provider_reported
- Title: Hugging Face Open LLM Leaderboard URL: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard Publisher/Vendor: Hugging Face Accessed: 2025-08-14 Version_or_release: provider_reported