Mixture of Experts
MoE
Not one model — many small 'expert' sub-models. Each query activates only a few — making models both big and fast.
In a classic Transformer every token flows through every parameter. MoE (Mixture of Experts) works differently: the model is made of many small "expert" sub-networks, and a gating network decides per token which experts to activate.
The result: huge-capacity models that compute lightly. Mixtral 8x7B has 8 experts and 47B total parameters, but each token activates just 2 → ~13B params compute per token. 47B-quality at 13B-speed.
DeepSeek V3 (671B total, 37B active), GPT-4's presumed architecture, Llama 4 — most modern frontier models are MoE.
A hospital. For each patient not all doctors examine at once — triage (the gating network) calls only the relevant specialists. There may be 50 doctors total; for one patient only 2 are needed. Big capacity, efficient use.
You're running Mixtral 8x7B. User asks "write a fizz buzz program." As tokens enter: - "fizz" → gating: activate code expert + math expert. - "buzz" → gating: same pair fits. - " " (space) → gating: grammar expert + format expert.
Each token routes to its own "expert pair." Output quality matches 47B, compute matches 13B. DeepSeek V3 pushed this to 256 experts — total 671B, active 37B.
- When you need a very large model but latency matters (smart tradeoff)
- Multi-task specialization — one expert for math, one for code, one for translation
- Cutting inference cost — fewer active params = cheaper compute
- If a small model does the job — MoE infra is complex
- Single-task workloads — you don't benefit from expert diversity
- Tight VRAM — total params still must load; only compute is sparse
One expert handles everything (expert collapse)
During training the gating network may funnel everything to 1-2 experts; the rest atrophy. A load-balancing loss is mandatory.
VRAM ≠ active params
Mixtral 8x7B needs ~90GB VRAM, not 13GB (all 47B params must load). Active count only reduces compute load.
Distillation and fine-tune are hard
Distilling MoE into a dense model is messier than tuning one base. The tooling ecosystem is still maturing.