AI Dictionary
Intermediate· ~2 min read#n_gpu_layers#gpu-offload#inference

n_gpu_layers

GPU layer offload count

A llama.cpp parameter that controls how many layers run on GPU vs CPU. A lifesaver when VRAM is tight.

HOW MANY LAYERS ON GPU vs CPUGPU · -ngl 5layer 1layer 2layer 3layer 4layer 5CPU + RAMlayer 6layer 7layer 8either all on GPU (-ngl 99) or hybrid — even one CPU layer is a bottleneck
Definition

llama.cpp (and Ollama, LM Studio built on top) can load model layers piecewise onto the GPU. -ngl N (Number of GPU Layers): the first N layers run on GPU, the rest stay on CPU + RAM.

Practical values: - -ngl 0: all on CPU. Very slow but zero VRAM. - -ngl 20 (e.g.): first 20 of a 33-layer 8B model on GPU. Hybrid. - -ngl 99 or -ngl -1: all layers on GPU (fastest).

Speed scaling is non-linear. Full GPU = full speed; even one layer on CPU forces a CPU↔GPU transfer per token at that layer — major bottleneck. Rule of thumb: all or nothing is best; hybrid is acceptable for "make it work at all."

Math: VRAM per layer ≈ model size / num_layers. Llama 3.1 8B (Q4): ~5 GB / 33 layers ≈ 152 MB/layer. With 4 GB VRAM you can fit ~26 layers (-ngl 26); the remaining 7 stay on CPU. Approx speed: ~30-50% of full GPU.

Analogy

A restaurant kitchen. All cooks in the kitchen (GPU) → max speed. One cook in a side building (CPU) → every dish must travel through there for that step → the line jams. n_gpu_layers controls "how many cooks are in the main kitchen." Full or empty — anything in between is slow but possible.

Real-world example

You have an RTX 4080 with 16 GB VRAM. You want to run Llama 3.1 70B Q4 (~40 GB on disk, doesn't fit).

-ngl 20: 20 of 80 layers on GPU (~10 GB), 60 on CPU. Speed: ~3 tok/s. Slow but it runs.

-ngl 0: all CPU. Speed: ~0.8 tok/s. Works if you have 64 GB RAM.

Drop to a smaller model: 8B Q4 (~5 GB), -ngl 99 full GPU → ~140 tok/s. Same hardware, 47× speed difference.

Ollama usually auto-tries -ngl 99; OLLAMA_NUM_GPU overrides. LM Studio's GUI has a "GPU Offload" slider for this.

When to use
  • VRAM is tight but you can't swap models — hybrid offload
  • Sharing one GPU across multiple models — budget each one
  • Quantization wasn't enough, still need some offload
  • Optimizing mixed CPU + GPU hardware
When not to use
  • Whole model fits on GPU — -ngl 99 and move on
  • High-throughput production — hybrid always tanks throughput
  • Tiny model (1-3B) — overhead not worth it
Common pitfalls

Forgetting to set ngl

If your GPU is present but -ngl 0 is set, you're 100% on CPU. The answer to 'why is my M2 Max only 5 tok/s' is usually this. Default varies — set it explicitly.

Cache isn't offloaded

n_gpu_layers offloads weights, not the KV cache. Cache at 32K context is already 4 GB. Plan: needs to fit on GPU: ngl × layer_size + cache.

Hybrid performance illusion

'50% GPU offload' ≠ 50% speed — usually worse. Each token incurs CPU↔GPU transfer latency. Benchmark, find 'X ngl gets Y tok/s' for your specific case.