Quantization
Model compression
Storing model weights at lower precision (FP16 → INT8 → INT4) to shrink size, memory, and cost — with a small accuracy hit.
A trained model's weights are usually stored as FP32 (32-bit floating-point). Quantization represents those numbers with fewer bits: FP16, INT8, INT4. Each weight goes from 32 bits to 4 = 8× less memory.
Method types: - PTQ (Post-Training Quantization): shrink weights after training. Fast but slight accuracy loss. - QAT (Quantization-Aware Training): teach the model during training to tolerate low precision. Better accuracy, more expensive. - GPTQ, AWQ, GGUF: practical algorithms / formats in production.
Result: a 70B Llama can run on a consumer GPU (24 GB VRAM). Or you can run a 7B on a phone. Typical accuracy loss is ~1-2% at INT8, ~3-5% at INT4 — usually acceptable.
Like dropping a photo from 24-bit to 8-bit color. 256 colors instead of 16M. The picture is still recognizable, with slight detail loss. Quantization works the same way: weight precision drops but the model's "picture" mostly survives.
You want to run Llama-3-70B on your laptop. FP16 needs 140 GB VRAM — impossible. Solution:
1. Convert the model to GGUF with llama.cpp, Q4_K_M (4-bit)
quantize: file size 140 GB → 40 GB.
2. M2 Max MacBook with 36 GB unified memory can run it.
3. Inference speed: ~12 tokens/sec (close to human reading rate).
4. Accuracy loss: 3-4%, unnoticeable for most tasks.
Pre-2023 a 70B at home was impossible; today quantization makes it standard practice. Every popular model on Hugging Face has GGUF/AWQ variants ready.
- Running big models on limited VRAM/GPU budget
- Edge deployment (phone, embedded, browser)
- Cutting inference cost (cloud GPU hours are pricey)
- Fitting more concurrent users on the same GPU
- High-stakes domains (medical diagnosis, financial analysis) — small accuracy hit is costly
- For training — quantization is for inference, not training
- Very small models (~1-3B) — overhead not worth the savings
- Edge cases (long context, multilingual) — quantized models can drop noticeably
Hard to measure accuracy loss
General benchmarks (MMLU, HumanEval) show small drops, but your specific use case may degrade more. Always test on your own eval set.
Don't quantize all layers the same
Mixed precision: some layers (attention output) at FP16, others INT4. AWQ, GPTQ do this automatically. Naive 'INT4 everywhere' yields poor quality.
Format/tool incompatibility
GGUF (llama.cpp), GPTQ (Hugging Face), AWQ, EXL2 — different frameworks use different formats. Verify which format your serving stack supports before committing.