Mixture of Experts — Explained

Definition

In a classic Transformer every token flows through every parameter. MoE (Mixture of Experts) works differently: the model is made of many small "expert" sub-networks, and a gating network decides per token which experts to activate.

The result: huge-capacity models that compute lightly. Mixtral 8x7B has 8 experts and 47B total parameters, but each token activates just 2 → ~13B params compute per token. 47B-quality at 13B-speed.

DeepSeek V3 (671B total, 37B active), GPT-4's presumed architecture, Llama 4 — most modern frontier models are MoE.

Analogy

A hospital. For each patient not all doctors examine at once — triage (the gating network) calls only the relevant specialists. There may be 50 doctors total; for one patient only 2 are needed. Big capacity, efficient use.

Real-world example

You're running Mixtral 8x7B. User asks "write a fizz buzz program." As tokens enter: - "fizz" → gating: activate code expert + math expert. - "buzz" → gating: same pair fits. - " " (space) → gating: grammar expert + format expert.

Each token routes to its own "expert pair." Output quality matches 47B, compute matches 13B. DeepSeek V3 pushed this to 256 experts — total 671B, active 37B.

When to use

When you need a very large model but latency matters (smart tradeoff)
Multi-task specialization — one expert for math, one for code, one for translation
Cutting inference cost — fewer active params = cheaper compute

When not to use

If a small model does the job — MoE infra is complex
Single-task workloads — you don't benefit from expert diversity
Tight VRAM — total params still must load; only compute is sparse

Common pitfalls

One expert handles everything (expert collapse)

During training the gating network may funnel everything to 1-2 experts; the rest atrophy. A load-balancing loss is mandatory.

VRAM ≠ active params

Mixtral 8x7B needs ~90GB VRAM, not 13GB (all 47B params must load). Active count only reduces compute load.

Distillation and fine-tune are hard

Distilling MoE into a dense model is messier than tuning one base. The tooling ecosystem is still maturing.