Public Preview

Evals Benchmarks And Metrics documentation

Evals, Benchmarks, and Metrics

Definition

Evaluation frameworks and benchmarks quantify model quality, safety, and performance across tasks and modalities.

Why It Matters

  • Enables model selection and regression detection
  • Informs cost–quality tradeoffs and capacity planning
  • Essential for safety monitoring and compliance

2025 State of the Art

  • Text: MMLU, MT-Bench, HumanEval, BIG-bench, HELM style leaderboards
  • Image: ImageNet zero-shot, VQAv2, DocVQA; diffusion quality via FID/CLIPScore/human
  • Video: VBench, T2VBench; temporal consistency and instruction following
  • Audio/Speech/Music: ASR WER (LibriSpeech), MOS for TTS, MusicCaps/subjective panels
  • Multimodal: MMMU, MathVista, Chart QA; human eval at scale with annotation

Key Players

  • LMSys, OpenAI Evals, Hugging Face Open LLM Leaderboard
  • Google/DeepMind, Anthropic, Meta; academic labs

Challenges

  • Prompt leakage and overfitting to public benchmarks
  • Poor correlation between automatic metrics and human quality
  • Reproducibility across toolchains and decoding settings

Reference Architectures

  • CI eval pipeline with curated suites, randomization, and seeded decoding
  • Human-in-the-loop annotation and rubric-based scoring
  • Telemetry for live quality (win rates, complaint rate)

Opportunities

  • Task-specific evals (domain correctness and safety)
  • Multi-turn and tool-use evals beyond single-turn chat
  • Multimodal ground-truth creation and automated raters

Design Checklist & Acceptance Criteria

  • Define win-rate targets vs. baselines (A/B)
  • Separate dev/test to avoid overfitting; rotate hidden sets
  • Track decoding params and seed for reproducibility
  • Add safety evals (jailbreak, sensitive categories)
  • Report costs/latency alongside quality

References

  • Title: MT-Bench and Chatbot Arena URL: https://lmsys.org/blog/2023-05-03-arena/ Publisher/Vendor: LMSys Accessed: 2025-08-14 Version_or_release: provider_reported
  • Title: OpenAI Evals URL: https://github.com/openai/evals Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: provider_reported
  • Title: Hugging Face Open LLM Leaderboard URL: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard Publisher/Vendor: Hugging Face Accessed: 2025-08-14 Version_or_release: provider_reported