Knowledge Distillation — Explained

Definition

Knowledge distillation, proposed by Hinton et al. in 2015, is a model-compression method. The idea: teach a small student not just the teacher's final labels, but its probability distributions (soft labels).

In classic fine-tuning the label is "dog" (hard label). In distillation it's "87% dog, 8% wolf, 3% fox, 2% other" (soft label). This soft information teaches the student not only the correct answer but the model's nuance — relationships between classes.

Result: fit a 175B teacher's knowledge into a 7B student. Accuracy drop: 5-10%. But 25× faster, 30× cheaper. DistilBERT, TinyLlama, the Phi series — all distillation products.

Analogy

Like a senior doctor passing experience to a new resident. Telling the resident "diagnoses correctly or not" isn't enough — saying "I think 70% X, 20% Y, 10% Z" is far more instructive. That nuance rapidly builds the resident's intuition.

Real-world example

OpenAI's GPT-4o-mini is believed to be a distilled GPT-4o (not officially confirmed). Result: 85-90% of GPT-4o benchmarks at 25× cheaper, 5× faster. Used in most high-volume production systems.

DeepSeek's R1-Distill-Llama-70B: fine-tuned Llama-70B with DeepSeek R1 reasoning outputs. Result: clearly beats pure Llama-70B on math/code; much smaller and cheaper than R1. Released open-weights.

Anthropic's Claude Haiku is likely also a distillation of Sonnet/Opus.

When to use

Production where the big model is too expensive — use a smaller distill
Latency-critical — smaller model is much faster
Edge deployment — run distilled model on phone/browser
Releasing an open model — distill from a larger closed one (mind legal)

When not to use

When you can't access the teacher (a closed API behind paywall) — collecting outputs for distill is hard/expensive
Diverse, long-tail tasks — student excels in narrow domain, may have general weakness
Critical reasoning required (math olympiad etc.) — student's ceiling is below teacher's

Common pitfalls

Copyright and ToS violations

OpenAI's ToS forbids using GPT outputs to train competing models. Collecting ChatGPT responses for distillation is legal risk. Open-weight models (Llama, Mistral) are safer.

Soft label quality

Wherever the teacher is wrong, the student learns the same. Teacher's eval quality is the student's ceiling. Bad teacher → bad student.

Distribution mismatch

Distilling from teacher on domain-X then using student on domain-Y — performance drops. Distillation data and production data should be similar.