Benchmark
Standardized evaluation
A standardized exam used to compare models on a single metric. So you can say "model X scored Y on task Z."
To call an AI model "good," you need a way to measure. A benchmark is a standard exam: everyone answers the same questions, scored by the same rules. Models become comparable.
Common benchmarks: - MMLU (Massive Multitask Language Understanding): multi-choice across 57 subjects (math, law, medicine, etc.) - HumanEval / MBPP: code — complete a Python function - GSM8K / MATH: math word problems - HellaSwag: sentence completion (commonsense) - BBH (BIG-Bench Hard): 23 hard reasoning tasks - GPQA: PhD-level science - SWE-bench: real bug fixes from GitHub issues - Chatbot Arena (LMSys): human-pairwise voting (most trusted — hardest to game)
New models race up these scoreboards. "GPT-5 hits 92 on MMLU" is the standard way to say "this is how smart."
Like university entrance exams. SAT, GRE — everyone solves the same problems, scores are listable. The only objective way to compare candidates is a standard exam. Benchmarks are AI's SAT.
When DeepSeek V3 launched (late 2024), its benchmarks: - MMLU: 88.5 (GPT-4o 88.7 — basically tied) - GSM8K: 89.3 (math) - HumanEval: 82.6 (code) - AlignBench: highest (Chinese alignment)
But: in Chatbot Arena's human voting it ranked lower. Because being good on benchmarks ≠ producing answers users actually like. In practice Claude/GPT feel more natural.
Benchmarks are a starting filter — production choice still requires your own eval set.
- Pre-filtering when selecting a model
- Understanding a new model's capabilities (MMLU 50 → 90 is a huge gap)
- Building a domain-specific bench (for your company's tasks)
- Comparing fine-tuned vs base models
- Picking a model on benchmarks alone — production behaves differently
- Measuring user satisfaction with a benchmark (it's subjective)
- Trusting old benchmarks (model may have seen them — contamination)
Benchmark contamination (test questions in training data)
MMLU was published in 2020; GPT-4 trained on web data → test items likely in training. Inflated scores. New benchmarks (GPQA) emerge; the gold standard shifts.
Single-metric focus
A model scores 92 on MMLU but users reject its answers in production. MMLU doesn't measure helpful/harmless/honest. Multi-dimensional eval matters.
Synthetic vs real-task gap
Multi-choice exams don't reflect the real world. SWE-bench (real GitHub issues), Chatbot Arena (human pairwise) are more trustworthy but harder to measure.