LLM
Large Language Model
A massive AI model trained on huge text corpora that learns to predict the next token and can hold conversations.
An LLM is a neural network with billions of parameters. Its single core job: probabilistically predict the next token given some text. When it does that well enough, capabilities like answering, writing, and coding emerge.
Almost all are built on the Transformer architecture. Training is typically three-stage: pretraining on trillions of tokens (learn language structure), fine-tuning on human examples, then RLHF to align with human preferences.
Major families: GPT (OpenAI), Claude (Anthropic), Gemini (Google), Llama (Meta — open weights), DeepSeek, Mistral, Qwen. Sizes range 2B–2T parameters; performance depends on size, data, and training tricks.
Think of phone keyboard autocomplete — that "next word" suggestion. An LLM is autocomplete a million times more powerful, with much deeper context. It can complete not just words, but paragraphs and entire functions. But it's still guessing — it has no real beliefs or facts.
A law firm sends a contract to Claude for summarization. The API call:
system: "You are a legal assistant. Reply in Turkish, numbered."
user: [10-page contract text] + "Summarize this in 5 bullets."
Claude tokenizes the input (~8000 tokens), generates each next token by probability, and produces a 5-bullet Turkish summary. There's no "understanding" — just statistical prediction. But the prediction is so good the output reads like a human's.
- Natural-language tasks: summary, translation, rewriting, explanation
- Code generation/explanation (Copilot, Cursor)
- Prototyping classification/extraction (LLM first, dedicated model later)
- Chat assistants, support bots
- Turning structured data into prose (reports, emails)
- Precise math/numeric computation (use a calculator/code-execution tool)
- Real-time data lookups (need RAG or function calling)
- When fully deterministic output is required (even T=0 isn't 100% guaranteed)
- Low-latency critical paths (LLM responses are ~1–30s)
Treating LLMs as fact sources
An LLM predicts statistics from training data, not truth. 'What happened in year X?' may yield confident wrong answers. For critical info use RAG or web search.
Underestimating token cost
1000 output tokens ≈ 750 words. 10K users × 5 messages × 1000 tokens = 50M tokens daily. At GPT-4 prices that's serious money. Plan for caching and smaller models.
Believing one model fits all
Use a cheap/fast model for classification, mid-tier for chat, premium for hard tasks. Hybrid routing cuts cost ~70%.