AI Dictionary
Intermediate· ~2 min read#batch-size#throughput#inference

Batch Size

Sequences processed in parallel

How many sequences are processed in parallel. Dramatically boosts throughput at the cost of VRAM and latency.

N SEQUENCES IN PARALLEL · ONE GPUconcurrent requestsreq 1req 2req 3req 4req 5req 6GPUbatch= 6one passthroughput~6× ↑vLLM continuous batching — a new request fills any vacant slot
Definition

GPUs are parallel matrix machines; using one for a single request usually wastes capacity. Batch size N: process N different sequences in the same forward pass. The GPU does N matmuls in one shot — total throughput jumps.

Training vs. inference: - Training batch size: examples per gradient step. 32-512 typical. Bigger = smoother gradients, more VRAM. - Inference batch size: concurrent requests. Critical for throughput in production. vLLM defaults to 256.

Two flavors of inference batching: - Static batching: wait for a group of requests, start all together, hold the GPU until all finish — slow requests stall fast ones. - Continuous batching: vLLM and modern engines. As one request finishes, a new one slots in immediately; GPU never idles, latency drops. Throughput goes up 5-10×.

KV cache scales with batch — N requests = N × cache. Production caps it via max_num_seqs (vLLM).

Analogy

A baker bakes 20 loaves at once instead of one. Total time per loaf isn't 1/20 — but you produce 20× more bread per oven cycle. GPUs: one matmul can carry 20 sequences in parallel.

Real-world example

vLLM serving Llama 3.1 8B.

Batch=1 (single user): ~95 tok/s. Batch=8: GPU starts to fill, ~620 tok/s total (~78 per user). Batch=32: ~2400 tok/s total (~75 per user). 25× the single-stream. Batch=64: VRAM tops out, risk OOM or swap. Typical ceiling 32-64.

Continuous batching swaps a finished request for a new one immediately — no idle slot. Static would wait for all 32 to finish and let long requests block fast ones (head-of-line blocking). Continuous solves this.

When to use
  • Production serving — throughput is non-negotiable
  • Offline batch inference — process 1000 prompts at once
  • Multi-user serving — vLLM, TGI, SGLang
  • Embedding generation — embed many texts in one call
When not to use
  • Single-user local play (Ollama default is single-stream — already fast enough)
  • Latency-sensitive single request — batching adds wait
  • Tight VRAM — bumping batch causes OOM
Common pitfalls

Picking static batching

Old setups (naive HF Transformers serve) use static batching = slowest request blocks the rest. Use a modern engine (vLLM, TGI) with continuous batching.

Batch size = throughput limit illusion

Setting batch=128 with only 8 active users wastes the cap. Effective batch = active concurrency. In capacity planning, look at real concurrency.

Skipping the VRAM math

Batch×(model + cache_per_seq) = total VRAM. Cache scales linearly with batch. 32 seqs × 4 GB cache = 128 GB. PagedAttention reduces it to ~75%.