Transformer
The architecture behind LLMs
The neural-network architecture behind modern LLMs that learns how much each word should attend to every other word.
Introduced in Google's 2017 paper "Attention Is All You Need," the Transformer uses a self-attention mechanism. To understand a word, the model decides for itself which other words in the sentence to look at — and how strongly.
Earlier architectures (RNN, LSTM) processed text word by word, in order. Transformers run in parallel — every word "looks at" every other word simultaneously. This is fast on GPUs and captures long-range context.
Two main parts: encoder (understand the input — BERT) and decoder (produce output — GPT). Most modern LLMs are decoder-only. Thanks to multi-head attention the model learns multiple relationship types at once (syntactic, semantic, referential).
When reading a sentence your eyes jump back to earlier words: who is "they"? Which "thing" is "that"? Self-attention does exactly that — for every word, the model decides how much to look at every other word to pin down meaning. Multi-head = looking at the same sentence with different glasses (grammar lens, meaning lens, reference lens).
"Ali threw the ball to Mehmet because he was tired." Who is "he"? Ali or Mehmet?
A human: infers from context — the receiver, not the thrower, was tired. An RNN: holds prior words one by one, can confuse the antecedent at "he." A Transformer: "he" puts attention on all prior words, weights "Mehmet" more strongly. Resolves correctly.
Multiply this trivial case by 96 attention heads × 96 layers and you see why language understanding suddenly took off.
- Pretty much all language tasks (LLM foundation)
- Vision modeling (Vision Transformer — ViT)
- Code, music generation, protein folding (AlphaFold)
- General sequence-to-sequence problems
- Very short, simple patterns (logistic regression is enough)
- Pure sequential time-series forecasting (LSTM is still competitive)
- Low-resource environments — Transformers eat GPU
Quadratic complexity
Attention scales with the square of sequence length. 100K tokens = 10 billion attention ops. Long-context demands optimizations: Flash Attention, sliding-window, sparse attention.
Position info isn't automatic
Self-attention doesn't see order! Without positional encoding (RoPE, ALiBi), 'cat dog' equals 'dog cat'.
Treating it as the final word
Transformers aren't best for everything. State-space models (Mamba), Mixture-of-Experts are active alternatives. Architecture is still evolving.