Token
The atomic unit of text for an LLM
The smallest unit of text an LLM reads — usually a word, sub-word piece, or punctuation.
LLMs don't process letters or whole words — they process tokens. A token can be a whole word ("cat"), a word piece ("token", "ization"), a single punctuation mark, or a space. The split process is called tokenization.
Rule of thumb: in English ~1 token = 0.75 words, or ~4 characters. Turkish and other agglutinative languages tokenize into more pieces (one word can become 3–4 tokens). Chinese/Japanese have their own dynamics.
Modern tokenizers use BPE (Byte-Pair Encoding) or SentencePiece. They merge frequent sub-strings into single tokens and split rare words into pieces. Vocab sizes typically range 32K–256K.
Like splitting a sentence into LEGO bricks. Each brick is a token. The LLM first chops the sentence into bricks, then converts each brick to a numeric ID and goes from there. "merhaba" → 5 pieces (mer, ha, ba, _, space) depending on the model.
"ChatGPT is amazing!" → GPT-4 tokenizer: ["Chat", "G", "PT", " is", " amazing", "!"] = 6 tokens.
"Yapay zeka harika!" → ["Yap", "ay", " zeka", " harika", "!"] = 5 tokens.
Same meaning, different token count. Turkish typically uses 30–50% more tokens for the same content — meaning your API bill is 30–50% higher for the same prose.
See it for yourself in the OpenAI Tokenizer Playground, or count
programmatically with the tiktoken library.
- Estimating API cost (every token = money)
- Measuring whether the context window is full
- Prompt optimization — trim tokens to cut cost
- Managing chunked output in streaming applications
- Confusing character count with token count (1 token ≠ 1 character)
- Equating word count with token count (especially wrong in non-English)
- Thinking of the tokenizer separately from the model — each model has its own
Non-English cost surprises
Same content: 1000 tokens English, 1500 tokens Turkish. Estimate before the API bill arrives. For some projects, translating TR → EN → answer → translate back is actually cheaper.
Tokenizer mismatches
GPT-4 and Claude use different tokenizers. 'This prompt is 4000 tokens' — on which model? Always count with the correct tokenizer.
Whitespace and special characters
Spaces, newlines, emojis are separate tokens. When asking for JSON, the {, }, : also burn tokens. Keep the output format lean.