AI Dictionary
Advanced· ~2 min read#quantization#compression#inference

Quantization

Model compression

Storing model weights at lower precision (FP16 → INT8 → INT4) to shrink size, memory, and cost — with a small accuracy hit.

REDUCE PRECISION — SHRINK SIZE70B parameter model:FP32280 GBoriginalFP16140 GBINT870 GBINT435 GB8× smallerINT4 quant: 70B models on a laptop — with a small accuracy hit
Definition

A trained model's weights are usually stored as FP32 (32-bit floating-point). Quantization represents those numbers with fewer bits: FP16, INT8, INT4. Each weight goes from 32 bits to 4 = 8× less memory.

Method types: - PTQ (Post-Training Quantization): shrink weights after training. Fast but slight accuracy loss. - QAT (Quantization-Aware Training): teach the model during training to tolerate low precision. Better accuracy, more expensive. - GPTQ, AWQ, GGUF: practical algorithms / formats in production.

Result: a 70B Llama can run on a consumer GPU (24 GB VRAM). Or you can run a 7B on a phone. Typical accuracy loss is ~1-2% at INT8, ~3-5% at INT4 — usually acceptable.

Analogy

Like dropping a photo from 24-bit to 8-bit color. 256 colors instead of 16M. The picture is still recognizable, with slight detail loss. Quantization works the same way: weight precision drops but the model's "picture" mostly survives.

Real-world example

You want to run Llama-3-70B on your laptop. FP16 needs 140 GB VRAM — impossible. Solution:

1. Convert the model to GGUF with llama.cpp, Q4_K_M (4-bit) quantize: file size 140 GB → 40 GB. 2. M2 Max MacBook with 36 GB unified memory can run it. 3. Inference speed: ~12 tokens/sec (close to human reading rate). 4. Accuracy loss: 3-4%, unnoticeable for most tasks.

Pre-2023 a 70B at home was impossible; today quantization makes it standard practice. Every popular model on Hugging Face has GGUF/AWQ variants ready.

When to use
  • Running big models on limited VRAM/GPU budget
  • Edge deployment (phone, embedded, browser)
  • Cutting inference cost (cloud GPU hours are pricey)
  • Fitting more concurrent users on the same GPU
When not to use
  • High-stakes domains (medical diagnosis, financial analysis) — small accuracy hit is costly
  • For training — quantization is for inference, not training
  • Very small models (~1-3B) — overhead not worth the savings
  • Edge cases (long context, multilingual) — quantized models can drop noticeably
Common pitfalls

Hard to measure accuracy loss

General benchmarks (MMLU, HumanEval) show small drops, but your specific use case may degrade more. Always test on your own eval set.

Don't quantize all layers the same

Mixed precision: some layers (attention output) at FP16, others INT4. AWQ, GPTQ do this automatically. Naive 'INT4 everywhere' yields poor quality.

Format/tool incompatibility

GGUF (llama.cpp), GPTQ (Hugging Face), AWQ, EXL2 — different frameworks use different formats. Verify which format your serving stack supports before committing.