Transformer — Explained · AI Sözlüğü

Definition

Introduced in Google's 2017 paper "Attention Is All You Need," the Transformer uses a self-attention mechanism. To understand a word, the model decides for itself which other words in the sentence to look at — and how strongly.

Earlier architectures (RNN, LSTM) processed text word by word, in order. Transformers run in parallel — every word "looks at" every other word simultaneously. This is fast on GPUs and captures long-range context.

Two main parts: encoder (understand the input — BERT) and decoder (produce output — GPT). Most modern LLMs are decoder-only. Thanks to multi-head attention the model learns multiple relationship types at once (syntactic, semantic, referential).

Analogy

When reading a sentence your eyes jump back to earlier words: who is "they"? Which "thing" is "that"? Self-attention does exactly that — for every word, the model decides how much to look at every other word to pin down meaning. Multi-head = looking at the same sentence with different glasses (grammar lens, meaning lens, reference lens).

Real-world example

"Ali threw the ball to Mehmet because he was tired." Who is "he"? Ali or Mehmet?

A human: infers from context — the receiver, not the thrower, was tired. An RNN: holds prior words one by one, can confuse the antecedent at "he." A Transformer: "he" puts attention on all prior words, weights "Mehmet" more strongly. Resolves correctly.

Multiply this trivial case by 96 attention heads × 96 layers and you see why language understanding suddenly took off.

A deeper look

When to use

Pretty much all language tasks (LLM foundation)
Vision modeling (Vision Transformer — ViT)
Code, music generation, protein folding (AlphaFold)
General sequence-to-sequence problems

When not to use

Very short, simple patterns (logistic regression is enough)
Pure sequential time-series forecasting (LSTM is still competitive)
Low-resource environments — Transformers eat GPU

Common pitfalls

Quadratic complexity

Attention scales with the square of sequence length. 100K tokens = 10 billion attention ops. Long-context demands optimizations: Flash Attention, sliding-window, sparse attention.

Position info isn't automatic

Self-attention doesn't see order! Without positional encoding (RoPE, ALiBi), 'cat dog' equals 'dog cat'.

Treating it as the final word

Transformers aren't best for everything. State-space models (Mamba), Mixture-of-Experts are active alternatives. Architecture is still evolving.