Quantization
Model compression
Storing model weights at lower precision (FP16 → INT8 → INT4) to shrink size, memory, and cost — with a small accuracy hit.
A trained model's weights are usually stored as FP32 (32-bit floating-point). Quantization represents those numbers with fewer bits: FP16, INT8, INT4. Each weight goes from 32 bits to 4 = 8× less memory.
Method types: - PTQ (Post-Training Quantization): shrink weights after training. Fast but slight accuracy loss. - QAT (Quantization-Aware Training): teach the model during training to tolerate low precision. Better accuracy, more expensive. - GPTQ, AWQ, GGUF: practical algorithms / formats in production.
Result: a 70B Llama can run on a consumer GPU (24 GB VRAM). Or you can run a 7B on a phone. Typical accuracy loss is ~1-2% at INT8, ~3-5% at INT4 — usually acceptable.
Like dropping a photo from 24-bit to 8-bit color. 256 colors instead of 16M. The picture is still recognizable, with slight detail loss. Quantization works the same way: weight precision drops but the model's "picture" mostly survives.
You want to run Llama-3-70B on your laptop. FP16 needs 140 GB VRAM — impossible. Solution:
1. Convert the model to GGUF with llama.cpp, Q4_K_M (4-bit)
quantize: file size 140 GB → 40 GB.
2. M2 Max MacBook with 36 GB unified memory can run it.
3. Inference speed: ~12 tokens/sec (close to human reading rate).
4. Accuracy loss: 3-4%, unnoticeable for most tasks.
Pre-2023 a 70B at home was impossible; today quantization makes it standard practice. Every popular model on Hugging Face has GGUF/AWQ variants ready.
Format family comparison (for a 70B model):
- FP32 — 280 GB, training default, wasted at inference - FP16 / BF16 — 140 GB, the inference baseline - FP8 — 70 GB, requires H100+ hardware, ~1-2% accuracy hit - INT8 — 70 GB, broader hardware support - AWQ (4-bit) — 35 GB, vLLM/HF favorite, preserves accuracy - GPTQ (4-bit) — 35 GB, AWQ alternative - GGUF Q8_0 — 70 GB, llama.cpp standard, near-FP16 - GGUF Q5_K_M — 44 GB, very high quality, balanced - GGUF Q4_K_M — 35 GB, ★ recommended sweet spot - GGUF Q3_K_M — 26 GB, small but acceptable - GGUF Q2_K / IQ2_XS — 18 GB, experimental only
Ecosystem split: vLLM + HF = AWQ, GPTQ, FP8. llama.cpp + Ollama + LM Studio = GGUF (Q-formats). MLX = its own 4bit/8bit variants. Formats don't transfer — pull the right one for your runtime.
# 1. HF safetensors → GGUF FP16
python3 convert_hf_to_gguf.py \
/path/to/Llama-3.1-8B-Instruct \
--outfile model-f16.gguf
# 2. FP16 → Q4_K_M (the most common)
./build/bin/llama-quantize \
model-f16.gguf model-Q4_K_M.gguf Q4_K_M
# Resulting size:
# f16 = 16.0 GB
# Q8_0 = 8.5 GB (~1% accuracy hit)
# Q5_K_M = 5.7 GB (~2%)
# Q4_K_M = 4.6 GB (~3-4%) ★
# Q3_K_M = 3.8 GB (~5-7%)
# Q2_K = 3.0 GB (testing only)# casperhansen/llama-3-70b-instruct-awq → 35 GB
# Fits on a single A100 80GB (FP16 needs 2× A100)
from vllm import LLM, SamplingParams
llm = LLM(
model="casperhansen/llama-3-70b-instruct-awq",
quantization="awq",
tensor_parallel_size=1,
max_model_len=8192,
gpu_memory_utilization=0.90,
)
params = SamplingParams(temperature=0.3, max_tokens=512)
out = llm.generate(["Hello!"], params)
print(out[0].outputs[0].text)from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4", # NormalFloat-4
bnb_4bit_use_double_quant=True, # extra ~0.4% saving
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B-Instruct",
quantization_config=bnb,
device_map="auto",
)
# 70B FP16: 140 GB → NF4: ~35 GB
# This is the substrate for QLoRA fine-tuning.- Running big models on limited VRAM/GPU budget
- Edge deployment (phone, embedded, browser)
- Cutting inference cost (cloud GPU hours are pricey)
- Fitting more concurrent users on the same GPU
- High-stakes domains (medical diagnosis, financial analysis) — small accuracy hit is costly
- For training — quantization is for inference, not training
- Very small models (~1-3B) — overhead not worth the savings
- Edge cases (long context, multilingual) — quantized models can drop noticeably
Hard to measure accuracy loss
General benchmarks (MMLU, HumanEval) show small drops, but your specific use case may degrade more. Always test on your own eval set.
Don't quantize all layers the same
Mixed precision: some layers (attention output) at FP16, others INT4. AWQ, GPTQ do this automatically. Naive 'INT4 everywhere' yields poor quality.
Format/tool incompatibility
GGUF (llama.cpp), GPTQ (Hugging Face), AWQ, EXL2 — different frameworks use different formats. Verify which format your serving stack supports before committing.