Benchmark — Explained · AI Sözlüğü

Definition

To call an AI model "good," you need a way to measure. A benchmark is a standard exam: everyone answers the same questions, scored by the same rules. Models become comparable.

Common benchmarks: - MMLU (Massive Multitask Language Understanding): multi-choice across 57 subjects (math, law, medicine, etc.) - HumanEval / MBPP: code — complete a Python function - GSM8K / MATH: math word problems - HellaSwag: sentence completion (commonsense) - BBH (BIG-Bench Hard): 23 hard reasoning tasks - GPQA: PhD-level science - SWE-bench: real bug fixes from GitHub issues - Chatbot Arena (LMSys): human-pairwise voting (most trusted — hardest to game)

New models race up these scoreboards. "GPT-5 hits 92 on MMLU" is the standard way to say "this is how smart."

Analogy

Like university entrance exams. SAT, GRE — everyone solves the same problems, scores are listable. The only objective way to compare candidates is a standard exam. Benchmarks are AI's SAT.

Real-world example

When DeepSeek V3 launched (late 2024), its benchmarks: - MMLU: 88.5 (GPT-4o 88.7 — basically tied) - GSM8K: 89.3 (math) - HumanEval: 82.6 (code) - AlignBench: highest (Chinese alignment)

But: in Chatbot Arena's human voting it ranked lower. Because being good on benchmarks ≠ producing answers users actually like. In practice Claude/GPT feel more natural.

Benchmarks are a starting filter — production choice still requires your own eval set.

When to use

Pre-filtering when selecting a model
Understanding a new model's capabilities (MMLU 50 → 90 is a huge gap)
Building a domain-specific bench (for your company's tasks)
Comparing fine-tuned vs base models

When not to use

Picking a model on benchmarks alone — production behaves differently
Measuring user satisfaction with a benchmark (it's subjective)
Trusting old benchmarks (model may have seen them — contamination)

Common pitfalls

Benchmark contamination (test questions in training data)

MMLU was published in 2020; GPT-4 trained on web data → test items likely in training. Inflated scores. New benchmarks (GPQA) emerge; the gold standard shifts.

Single-metric focus

A model scores 92 on MMLU but users reject its answers in production. MMLU doesn't measure helpful/harmless/honest. Multi-dimensional eval matters.

Synthetic vs real-task gap

Multi-choice exams don't reflect the real world. SWE-bench (real GitHub issues), Chatbot Arena (human pairwise) are more trustworthy but harder to measure.