AI Dictionary
Beginner· ~2 min read#token#tokenization

Token

The atomic unit of text for an LLM

The smallest unit of text an LLM reads — usually a word, sub-word piece, or punctuation.

TEXT → TOKENS"Tokenization is how LLMs read text"Token#1024ization#1161is#1298how#1435LLMs#1572read#1709text#1846each token gets a numeric ID — that's what the LLM actually reads
Definition

LLMs don't process letters or whole words — they process tokens. A token can be a whole word ("cat"), a word piece ("token", "ization"), a single punctuation mark, or a space. The split process is called tokenization.

Rule of thumb: in English ~1 token = 0.75 words, or ~4 characters. Turkish and other agglutinative languages tokenize into more pieces (one word can become 3–4 tokens). Chinese/Japanese have their own dynamics.

Modern tokenizers use BPE (Byte-Pair Encoding) or SentencePiece. They merge frequent sub-strings into single tokens and split rare words into pieces. Vocab sizes typically range 32K–256K.

Analogy

Like splitting a sentence into LEGO bricks. Each brick is a token. The LLM first chops the sentence into bricks, then converts each brick to a numeric ID and goes from there. "merhaba" → 5 pieces (mer, ha, ba, _, space) depending on the model.

Real-world example

"ChatGPT is amazing!" → GPT-4 tokenizer: ["Chat", "G", "PT", " is", " amazing", "!"] = 6 tokens.

"Yapay zeka harika!" → ["Yap", "ay", " zeka", " harika", "!"] = 5 tokens.

Same meaning, different token count. Turkish typically uses 30–50% more tokens for the same content — meaning your API bill is 30–50% higher for the same prose.

See it for yourself in the OpenAI Tokenizer Playground, or count programmatically with the tiktoken library.

When to use
  • Estimating API cost (every token = money)
  • Measuring whether the context window is full
  • Prompt optimization — trim tokens to cut cost
  • Managing chunked output in streaming applications
When not to use
  • Confusing character count with token count (1 token ≠ 1 character)
  • Equating word count with token count (especially wrong in non-English)
  • Thinking of the tokenizer separately from the model — each model has its own
Common pitfalls

Non-English cost surprises

Same content: 1000 tokens English, 1500 tokens Turkish. Estimate before the API bill arrives. For some projects, translating TR → EN → answer → translate back is actually cheaper.

Tokenizer mismatches

GPT-4 and Claude use different tokenizers. 'This prompt is 4000 tokens' — on which model? Always count with the correct tokenizer.

Whitespace and special characters

Spaces, newlines, emojis are separate tokens. When asking for JSON, the {, }, : also burn tokens. Keep the output format lean.