n_gpu_layers
GPU layer offload count
A llama.cpp parameter that controls how many layers run on GPU vs CPU. A lifesaver when VRAM is tight.
llama.cpp (and Ollama, LM Studio built on top) can load model
layers piecewise onto the GPU. -ngl N (Number of GPU Layers):
the first N layers run on GPU, the rest stay on CPU + RAM.
Practical values:
- -ngl 0: all on CPU. Very slow but zero VRAM.
- -ngl 20 (e.g.): first 20 of a 33-layer 8B model on GPU. Hybrid.
- -ngl 99 or -ngl -1: all layers on GPU (fastest).
Speed scaling is non-linear. Full GPU = full speed; even one layer on CPU forces a CPU↔GPU transfer per token at that layer — major bottleneck. Rule of thumb: all or nothing is best; hybrid is acceptable for "make it work at all."
Math: VRAM per layer ≈ model size / num_layers. Llama 3.1 8B (Q4):
~5 GB / 33 layers ≈ 152 MB/layer. With 4 GB VRAM you can fit ~26
layers (-ngl 26); the remaining 7 stay on CPU. Approx speed:
~30-50% of full GPU.
A restaurant kitchen. All cooks in the kitchen (GPU) → max speed. One cook in a side building (CPU) → every dish must travel through there for that step → the line jams. n_gpu_layers controls "how many cooks are in the main kitchen." Full or empty — anything in between is slow but possible.
You have an RTX 4080 with 16 GB VRAM. You want to run Llama 3.1 70B Q4 (~40 GB on disk, doesn't fit).
-ngl 20: 20 of 80 layers on GPU (~10 GB), 60 on CPU. Speed: ~3
tok/s. Slow but it runs.
-ngl 0: all CPU. Speed: ~0.8 tok/s. Works if you have 64 GB RAM.
Drop to a smaller model: 8B Q4 (~5 GB), -ngl 99 full GPU → ~140
tok/s. Same hardware, 47× speed difference.
Ollama usually auto-tries -ngl 99; OLLAMA_NUM_GPU overrides. LM
Studio's GUI has a "GPU Offload" slider for this.
- VRAM is tight but you can't swap models — hybrid offload
- Sharing one GPU across multiple models — budget each one
- Quantization wasn't enough, still need some offload
- Optimizing mixed CPU + GPU hardware
- Whole model fits on GPU —
-ngl 99and move on - High-throughput production — hybrid always tanks throughput
- Tiny model (1-3B) — overhead not worth it
Forgetting to set ngl
If your GPU is present but -ngl 0 is set, you're 100% on CPU. The answer to 'why is my M2 Max only 5 tok/s' is usually this. Default varies — set it explicitly.
Cache isn't offloaded
n_gpu_layers offloads weights, not the KV cache. Cache at 32K context is already 4 GB. Plan: needs to fit on GPU: ngl × layer_size + cache.
Hybrid performance illusion
'50% GPU offload' ≠ 50% speed — usually worse. Each token incurs CPU↔GPU transfer latency. Benchmark, find 'X ngl gets Y tok/s' for your specific case.