AI Dictionary
Beginner· ~2 min read#streaming#sse#ux

Streaming

Token-by-token output

The LLM streams its response as tokens are generated rather than waiting for the full answer. Users see the first words almost instantly.

RESPONSE STREAMS TOKEN BY TOKEN0.2sYapay0.4sYapay zeka0.6sYapay zeka geleceği0.9sYapay zeka geleceği şekillendiriyor.user doesn't wait for the whole answer — UI updates as each token arrives
Definition

An LLM produces tokens one by one. In a classic HTTP response, the full answer is sent after all tokens are generated (~1-30s wait). With streaming each token is shipped to the user as it's produced — the UI types it out letter-by-letter.

Stack: Server-Sent Events (SSE) is the standard. OpenAI, Anthropic, Google APIs all return SSE chunks when you pass stream: true. Each chunk holds one or a few tokens.

Benefit: perceived latency drops dramatically. Time-To-First-Token (TTFT) is usually 200-500ms, while the full answer can take 5-10s. With streaming, the user sees "model working" within 200ms — no blank-screen waiting.

Analogy

Like waiting in line at a butcher. Classic: the butcher hides every sausage in the back and hands you everything packaged at the end. Streaming: each sausage hits the counter as it's ready — you start eating the first one while the last is being made.

Real-world example

You ask ChatGPT for a long analysis. Without streaming: 8 seconds of blank screen, then 300 words drop at once. The user thinks "is it frozen?" With streaming: 200ms first token, 50-80 tokens/sec flow. The user starts reading immediately, doesn't feel the wait.

Minimal OpenAI Python SDK example: `` stream = client.chat.completions.create( model="gpt-4o", messages=[...], stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content or "" print(delta, end="", flush=True) ``

Frontend uses fetch + ReadableStream API or SSE EventSource.

When to use
  • Chat interfaces — reduces perceived wait time
  • Endpoints producing long answers (summaries, essays, code)
  • Reasoning models — long thinking time, terrible UX without streaming
  • Mobile UI — keeps users on slow networks
When not to use
  • When structured JSON output is required (can't parse partial JSON)
  • Function-calling responses — need an atomic response
  • Very short answers — overhead not worth it
  • Backend-to-backend integrations — no user, streaming is pointless
Common pitfalls

Error handling gets complex

What happens when a stream cuts midway? Save partial response? Retry from scratch? Many more edge cases than classic request/response.

Counting tokens and billing

You can cancel a stream client-side, but the backend may have already generated and billed tokens. Use AbortController correctly.

Don't try to parse JSON mid-stream

Stream + JSON works only with JSONL (line-delimited). Don't try to parse a single big JSON object incrementally — you need an incremental JSON parser.