Streaming
Token-by-token output
The LLM streams its response as tokens are generated rather than waiting for the full answer. Users see the first words almost instantly.
An LLM produces tokens one by one. In a classic HTTP response, the full answer is sent after all tokens are generated (~1-30s wait). With streaming each token is shipped to the user as it's produced — the UI types it out letter-by-letter.
Stack: Server-Sent Events (SSE) is the standard. OpenAI,
Anthropic, Google APIs all return SSE chunks when you pass
stream: true. Each chunk holds one or a few tokens.
Benefit: perceived latency drops dramatically. Time-To-First-Token (TTFT) is usually 200-500ms, while the full answer can take 5-10s. With streaming, the user sees "model working" within 200ms — no blank-screen waiting.
Like waiting in line at a butcher. Classic: the butcher hides every sausage in the back and hands you everything packaged at the end. Streaming: each sausage hits the counter as it's ready — you start eating the first one while the last is being made.
You ask ChatGPT for a long analysis. Without streaming: 8 seconds of blank screen, then 300 words drop at once. The user thinks "is it frozen?" With streaming: 200ms first token, 50-80 tokens/sec flow. The user starts reading immediately, doesn't feel the wait.
Minimal OpenAI Python SDK example:
``
stream = client.chat.completions.create(
model="gpt-4o",
messages=[...],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True)
``
Frontend uses fetch + ReadableStream API or SSE EventSource.
- Chat interfaces — reduces perceived wait time
- Endpoints producing long answers (summaries, essays, code)
- Reasoning models — long thinking time, terrible UX without streaming
- Mobile UI — keeps users on slow networks
- When structured JSON output is required (can't parse partial JSON)
- Function-calling responses — need an atomic response
- Very short answers — overhead not worth it
- Backend-to-backend integrations — no user, streaming is pointless
Error handling gets complex
What happens when a stream cuts midway? Save partial response? Retry from scratch? Many more edge cases than classic request/response.
Counting tokens and billing
You can cancel a stream client-side, but the backend may have already generated and billed tokens. Use AbortController correctly.
Don't try to parse JSON mid-stream
Stream + JSON works only with JSONL (line-delimited). Don't try to parse a single big JSON object incrementally — you need an incremental JSON parser.