Knowledge Distillation
Teacher → Student transfer
Train a smaller 'student' model using outputs from a large 'teacher' — student keeps most of teacher's capability with far fewer params.
Knowledge distillation, proposed by Hinton et al. in 2015, is a model-compression method. The idea: teach a small student not just the teacher's final labels, but its probability distributions (soft labels).
In classic fine-tuning the label is "dog" (hard label). In distillation it's "87% dog, 8% wolf, 3% fox, 2% other" (soft label). This soft information teaches the student not only the correct answer but the model's nuance — relationships between classes.
Result: fit a 175B teacher's knowledge into a 7B student. Accuracy drop: 5-10%. But 25× faster, 30× cheaper. DistilBERT, TinyLlama, the Phi series — all distillation products.
Like a senior doctor passing experience to a new resident. Telling the resident "diagnoses correctly or not" isn't enough — saying "I think 70% X, 20% Y, 10% Z" is far more instructive. That nuance rapidly builds the resident's intuition.
OpenAI's GPT-4o-mini is believed to be a distilled GPT-4o (not officially confirmed). Result: 85-90% of GPT-4o benchmarks at 25× cheaper, 5× faster. Used in most high-volume production systems.
DeepSeek's R1-Distill-Llama-70B: fine-tuned Llama-70B with DeepSeek R1 reasoning outputs. Result: clearly beats pure Llama-70B on math/code; much smaller and cheaper than R1. Released open-weights.
Anthropic's Claude Haiku is likely also a distillation of Sonnet/Opus.
- Production where the big model is too expensive — use a smaller distill
- Latency-critical — smaller model is much faster
- Edge deployment — run distilled model on phone/browser
- Releasing an open model — distill from a larger closed one (mind legal)
- When you can't access the teacher (a closed API behind paywall) — collecting outputs for distill is hard/expensive
- Diverse, long-tail tasks — student excels in narrow domain, may have general weakness
- Critical reasoning required (math olympiad etc.) — student's ceiling is below teacher's
Copyright and ToS violations
OpenAI's ToS forbids using GPT outputs to train competing models. Collecting ChatGPT responses for distillation is legal risk. Open-weight models (Llama, Mistral) are safer.
Soft label quality
Wherever the teacher is wrong, the student learns the same. Teacher's eval quality is the student's ceiling. Bad teacher → bad student.
Distribution mismatch
Distilling from teacher on domain-X then using student on domain-Y — performance drops. Distillation data and production data should be similar.