Evals, Benchmarks, and Metrics

Definition

Evaluation frameworks and benchmarks quantify model quality, safety, and performance across tasks and modalities.

Text: MMLU, MT-Bench, HumanEval, BIG-bench, HELM style leaderboards
Image: ImageNet zero-shot, VQAv2, DocVQA; diffusion quality via FID/CLIPScore/human
Video: VBench, T2VBench; temporal consistency and instruction following
Audio/Speech/Music: ASR WER (LibriSpeech), MOS for TTS, MusicCaps/subjective panels
Multimodal: MMMU, MathVista, Chart QA; human eval at scale with annotation

Title: MT-Bench and Chatbot Arena URL: https://lmsys.org/blog/2023-05-03-arena/ Publisher/Vendor: LMSys Accessed: 2025-08-14 Version_or_release: provider_reported
Title: OpenAI Evals URL: https://github.com/openai/evals Publisher/Vendor: OpenAI Accessed: 2025-08-14 Version_or_release: provider_reported
Title: Hugging Face Open LLM Leaderboard URL: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard Publisher/Vendor: Hugging Face Accessed: 2025-08-14 Version_or_release: provider_reported