Inference — Explained · AI Sözlüğü

Definition

AI models have two distinct life phases: training and inference. Training takes weeks, burns millions in GPU costs, happens once. Inference is measured in milliseconds, costs cents per request — but happens millions of times a day.

During inference, model weights are frozen (read-only). Your job: take input tokens, run a forward pass, produce output tokens. No backprop, no gradients.

"Deploying a model" really means "setting up inference infra": tools like vLLM, TGI, TensorRT-LLM load weights into GPUs, expose an HTTP API, and optimize latency and throughput.

Analogy

Memorizing a math book (training) vs using it in an exam (inference). Memorizing takes months, done once. The exam is in seconds, repeatedly. During the exam you don't update the book — you just produce answers.

Real-world example

What happens when you ask OpenAI "what's the capital of France?": 1. Your question becomes ~5 tokens. 2. GPT-4's weights are already loaded on H100 GPUs. 3. Tokens go in; the model generates each next token by probability. 4. ~10 output tokens come out ("Paris is the capital of France."), takes ~800ms, costs ~$0.0001.

The same model serves tens of thousands of queries per second worldwide. Training GPT-4 cost $100M+; each query costs pennies.

When to use

When pushing a model to production — inference infra choice (managed API vs self-host)
Latency optimization — batching, KV cache, speculative decoding
Cost planning — tokens per request × price
Edge deployment — running the model on user devices (mobile, browser)

When not to use

Confusing it with training — different infrastructure, different problems
If cold start is OK — a small model on CPU may be enough, no GPU needed
One-off analysis — a Colab notebook is fine, no need for a service

Common pitfalls

GPU memory (VRAM) ceiling

Running a 70B model in FP16 needs ~140GB of VRAM. Quantization (INT8, INT4) cuts this 2-4× but at some accuracy loss. Plan hardware up front.

Measuring latency by TTFT alone

Time-To-First-Token matters but isn't the only metric. Tokens/sec throughput, p99 latency, and prompt-cache hit rate matter too.

Skipping batching

Running one request at a time wastes the GPU. Continuous batching with 8-32 simultaneous requests boosts throughput 5-10×.