Inference
The model's runtime phase
The phase where a trained model is actually used to answer queries. Much shorter and cheaper than training — but happens every single request.
AI models have two distinct life phases: training and inference. Training takes weeks, burns millions in GPU costs, happens once. Inference is measured in milliseconds, costs cents per request — but happens millions of times a day.
During inference, model weights are frozen (read-only). Your job: take input tokens, run a forward pass, produce output tokens. No backprop, no gradients.
"Deploying a model" really means "setting up inference infra": tools like vLLM, TGI, TensorRT-LLM load weights into GPUs, expose an HTTP API, and optimize latency and throughput.
Memorizing a math book (training) vs using it in an exam (inference). Memorizing takes months, done once. The exam is in seconds, repeatedly. During the exam you don't update the book — you just produce answers.
What happens when you ask OpenAI "what's the capital of France?": 1. Your question becomes ~5 tokens. 2. GPT-4's weights are already loaded on H100 GPUs. 3. Tokens go in; the model generates each next token by probability. 4. ~10 output tokens come out ("Paris is the capital of France."), takes ~800ms, costs ~$0.0001.
The same model serves tens of thousands of queries per second worldwide. Training GPT-4 cost $100M+; each query costs pennies.
- When pushing a model to production — inference infra choice (managed API vs self-host)
- Latency optimization — batching, KV cache, speculative decoding
- Cost planning — tokens per request × price
- Edge deployment — running the model on user devices (mobile, browser)
- Confusing it with training — different infrastructure, different problems
- If cold start is OK — a small model on CPU may be enough, no GPU needed
- One-off analysis — a Colab notebook is fine, no need for a service
GPU memory (VRAM) ceiling
Running a 70B model in FP16 needs ~140GB of VRAM. Quantization (INT8, INT4) cuts this 2-4× but at some accuracy loss. Plan hardware up front.
Measuring latency by TTFT alone
Time-To-First-Token matters but isn't the only metric. Tokens/sec throughput, p99 latency, and prompt-cache hit rate matter too.
Skipping batching
Running one request at a time wastes the GPU. Continuous batching with 8-32 simultaneous requests boosts throughput 5-10×.