Batch Size
Sequences processed in parallel
How many sequences are processed in parallel. Dramatically boosts throughput at the cost of VRAM and latency.
GPUs are parallel matrix machines; using one for a single request usually wastes capacity. Batch size N: process N different sequences in the same forward pass. The GPU does N matmuls in one shot — total throughput jumps.
Training vs. inference: - Training batch size: examples per gradient step. 32-512 typical. Bigger = smoother gradients, more VRAM. - Inference batch size: concurrent requests. Critical for throughput in production. vLLM defaults to 256.
Two flavors of inference batching: - Static batching: wait for a group of requests, start all together, hold the GPU until all finish — slow requests stall fast ones. - Continuous batching: vLLM and modern engines. As one request finishes, a new one slots in immediately; GPU never idles, latency drops. Throughput goes up 5-10×.
KV cache scales with batch — N requests = N × cache. Production caps
it via max_num_seqs (vLLM).
A baker bakes 20 loaves at once instead of one. Total time per loaf isn't 1/20 — but you produce 20× more bread per oven cycle. GPUs: one matmul can carry 20 sequences in parallel.
vLLM serving Llama 3.1 8B.
Batch=1 (single user): ~95 tok/s. Batch=8: GPU starts to fill, ~620 tok/s total (~78 per user). Batch=32: ~2400 tok/s total (~75 per user). 25× the single-stream. Batch=64: VRAM tops out, risk OOM or swap. Typical ceiling 32-64.
Continuous batching swaps a finished request for a new one immediately — no idle slot. Static would wait for all 32 to finish and let long requests block fast ones (head-of-line blocking). Continuous solves this.
- Production serving — throughput is non-negotiable
- Offline batch inference — process 1000 prompts at once
- Multi-user serving — vLLM, TGI, SGLang
- Embedding generation — embed many texts in one call
- Single-user local play (Ollama default is single-stream — already fast enough)
- Latency-sensitive single request — batching adds wait
- Tight VRAM — bumping batch causes OOM
Picking static batching
Old setups (naive HF Transformers serve) use static batching = slowest request blocks the rest. Use a modern engine (vLLM, TGI) with continuous batching.
Batch size = throughput limit illusion
Setting batch=128 with only 8 active users wastes the cap. Effective batch = active concurrency. In capacity planning, look at real concurrency.
Skipping the VRAM math
Batch×(model + cache_per_seq) = total VRAM. Cache scales linearly with batch. 32 seqs × 4 GB cache = 128 GB. PagedAttention reduces it to ~75%.