Streaming — Explained · AI Sözlüğü

Definition

An LLM produces tokens one by one. In a classic HTTP response, the full answer is sent after all tokens are generated (~1-30s wait). With streaming each token is shipped to the user as it's produced — the UI types it out letter-by-letter.

Stack: Server-Sent Events (SSE) is the standard. OpenAI, Anthropic, Google APIs all return SSE chunks when you pass stream: true. Each chunk holds one or a few tokens.

Benefit: perceived latency drops dramatically. Time-To-First-Token (TTFT) is usually 200-500ms, while the full answer can take 5-10s. With streaming, the user sees "model working" within 200ms — no blank-screen waiting.

Analogy

Like waiting in line at a butcher. Classic: the butcher hides every sausage in the back and hands you everything packaged at the end. Streaming: each sausage hits the counter as it's ready — you start eating the first one while the last is being made.

Real-world example

You ask ChatGPT for a long analysis. Without streaming: 8 seconds of blank screen, then 300 words drop at once. The user thinks "is it frozen?" With streaming: 200ms first token, 50-80 tokens/sec flow. The user starts reading immediately, doesn't feel the wait.

Minimal OpenAI Python SDK example: ``stream = client.chat.completions.create( model="gpt-4o", messages=[...], stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content or "" print(delta, end="", flush=True)``

Frontend uses fetch + ReadableStream API or SSE EventSource.

When to use

Chat interfaces — reduces perceived wait time
Endpoints producing long answers (summaries, essays, code)
Reasoning models — long thinking time, terrible UX without streaming
Mobile UI — keeps users on slow networks

When not to use

When structured JSON output is required (can't parse partial JSON)
Function-calling responses — need an atomic response
Very short answers — overhead not worth it
Backend-to-backend integrations — no user, streaming is pointless

Common pitfalls

Error handling gets complex

What happens when a stream cuts midway? Save partial response? Retry from scratch? Many more edge cases than classic request/response.

Counting tokens and billing

You can cancel a stream client-side, but the backend may have already generated and billed tokens. Use AbortController correctly.

Don't try to parse JSON mid-stream

Stream + JSON works only with JSONL (line-delimited). Don't try to parse a single big JSON object incrementally — you need an incremental JSON parser.