AI Dictionary
Intermediate· ~2 min read#inference#deployment#runtime

Inference

The model's runtime phase

The phase where a trained model is actually used to answer queries. Much shorter and cheaper than training — but happens every single request.

TRAINING ↔ INFERENCETRAININGweeksmillions of $ in GPUweights are updateddone onceINFERENCEmillisecondsper request ~$0.001weights are frozenmillions of timesexpensive to make, cheap to run — that's a model
Definition

AI models have two distinct life phases: training and inference. Training takes weeks, burns millions in GPU costs, happens once. Inference is measured in milliseconds, costs cents per request — but happens millions of times a day.

During inference, model weights are frozen (read-only). Your job: take input tokens, run a forward pass, produce output tokens. No backprop, no gradients.

"Deploying a model" really means "setting up inference infra": tools like vLLM, TGI, TensorRT-LLM load weights into GPUs, expose an HTTP API, and optimize latency and throughput.

Analogy

Memorizing a math book (training) vs using it in an exam (inference). Memorizing takes months, done once. The exam is in seconds, repeatedly. During the exam you don't update the book — you just produce answers.

Real-world example

What happens when you ask OpenAI "what's the capital of France?": 1. Your question becomes ~5 tokens. 2. GPT-4's weights are already loaded on H100 GPUs. 3. Tokens go in; the model generates each next token by probability. 4. ~10 output tokens come out ("Paris is the capital of France."), takes ~800ms, costs ~$0.0001.

The same model serves tens of thousands of queries per second worldwide. Training GPT-4 cost $100M+; each query costs pennies.

When to use
  • When pushing a model to production — inference infra choice (managed API vs self-host)
  • Latency optimization — batching, KV cache, speculative decoding
  • Cost planning — tokens per request × price
  • Edge deployment — running the model on user devices (mobile, browser)
When not to use
  • Confusing it with training — different infrastructure, different problems
  • If cold start is OK — a small model on CPU may be enough, no GPU needed
  • One-off analysis — a Colab notebook is fine, no need for a service
Common pitfalls

GPU memory (VRAM) ceiling

Running a 70B model in FP16 needs ~140GB of VRAM. Quantization (INT8, INT4) cuts this 2-4× but at some accuracy loss. Plan hardware up front.

Measuring latency by TTFT alone

Time-To-First-Token matters but isn't the only metric. Tokens/sec throughput, p99 latency, and prompt-cache hit rate matter too.

Skipping batching

Running one request at a time wastes the GPU. Continuous batching with 8-32 simultaneous requests boosts throughput 5-10×.