Quantization — Explained · AI Sözlüğü

Definition

A trained model's weights are usually stored as FP32 (32-bit floating-point). Quantization represents those numbers with fewer bits: FP16, INT8, INT4. Each weight goes from 32 bits to 4 = 8× less memory.

Method types: - PTQ (Post-Training Quantization): shrink weights after training. Fast but slight accuracy loss. - QAT (Quantization-Aware Training): teach the model during training to tolerate low precision. Better accuracy, more expensive. - GPTQ, AWQ, GGUF: practical algorithms / formats in production.

Result: a 70B Llama can run on a consumer GPU (24 GB VRAM). Or you can run a 7B on a phone. Typical accuracy loss is ~1-2% at INT8, ~3-5% at INT4 — usually acceptable.

Analogy

Like dropping a photo from 24-bit to 8-bit color. 256 colors instead of 16M. The picture is still recognizable, with slight detail loss. Quantization works the same way: weight precision drops but the model's "picture" mostly survives.

Real-world example

You want to run Llama-3-70B on your laptop. FP16 needs 140 GB VRAM — impossible. Solution:

1. Convert the model to GGUF with llama.cpp, Q4_K_M (4-bit) quantize: file size 140 GB → 40 GB. 2. M2 Max MacBook with 36 GB unified memory can run it. 3. Inference speed: ~12 tokens/sec (close to human reading rate). 4. Accuracy loss: 3-4%, unnoticeable for most tasks.

Pre-2023 a 70B at home was impossible; today quantization makes it standard practice. Every popular model on Hugging Face has GGUF/AWQ variants ready.

When to use

Running big models on limited VRAM/GPU budget
Edge deployment (phone, embedded, browser)
Cutting inference cost (cloud GPU hours are pricey)
Fitting more concurrent users on the same GPU

When not to use

High-stakes domains (medical diagnosis, financial analysis) — small accuracy hit is costly
For training — quantization is for inference, not training
Very small models (~1-3B) — overhead not worth the savings
Edge cases (long context, multilingual) — quantized models can drop noticeably

Common pitfalls

Hard to measure accuracy loss

General benchmarks (MMLU, HumanEval) show small drops, but your specific use case may degrade more. Always test on your own eval set.

Don't quantize all layers the same

Mixed precision: some layers (attention output) at FP16, others INT4. AWQ, GPTQ do this automatically. Naive 'INT4 everywhere' yields poor quality.

Format/tool incompatibility

GGUF (llama.cpp), GPTQ (Hugging Face), AWQ, EXL2 — different frameworks use different formats. Verify which format your serving stack supports before committing.