AI Atlas
All guides
✂️GUIDE

Token Reduction Techniques

Practical ways to cut token cost when using LLMs — prompt compression, caching, structured output, smart RAG, and more.

TokensCostOptimization
TOKEN REDUCTIONBEFORE5 000 token✂️AFTER1 100 tokenSAVED78%cost ↓ speed ↑Cache · smaller model · structured output · smart RAG

Why bother?

Every LLM-using product hits a token bill at scale. Worse, more tokens = slower responses + more "context fatigue" (the model skims long prompts). Reducing them protects both your budget and your output quality.

Seven practical techniques follow, each with examples. You don't need to apply all of them; start with the highest ROI.

1. Use prompt caching

Anthropic, OpenAI, and the rest now offer prompt caching. If you reuse the same prefix (system prompt, fixed instructions, retrieval template) across calls, caching cuts that part's cost up to 10×.

from anthropic import Anthropic

client = Anthropic()

SYSTEM_PROMPT = """[5,000-token product doc + detailed rules...]"""

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "Customer question..."}],
)
# Subsequent calls within TTL hit the cache at a fraction of cost.

Catch: cache TTL is ~5 minutes. High-traffic endpoints win huge; low-traffic endpoints don't benefit.

2. Use structured output

A JSON schema kills the "sure, here's your answer" prefix the model adds to free text. Output shrinks; parsing is safe.

# Wrong: free text → long, hard to parse
prompt = "Tell me the sentiment of this review."
# Output: "Sure, the review is quite positive because..."  → 80 tokens

# Right: structured
import json

response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    system='Reply with JSON only. Schema: {"sentiment": "pos"|"neg"|"neu"}',
    messages=[{"role": "user", "content": review}],
    max_tokens=20,  # hard cap
)
result = json.loads(response.content[0].text)
# Output: {"sentiment": "pos"}  → 8 tokens

Aggressive max_tokens is the physical cap; respect it.

3. Smaller model first, escalate as needed

Not every request needs the strongest model. Cascade: try Haiku first; if quality is insufficient, escalate to Sonnet, then Opus.

def smart_route(question, context):
    # Easy, short → Haiku ($0.80/$4 per 1M tokens)
    if len(question) < 200 and not needs_reasoning(question):
        return call_model("claude-haiku-4-5-20251001", question, context)

    # Medium → Sonnet ($3/$15 per 1M)
    return call_model("claude-sonnet-4-6", question, context)

    # Only complex reasoning → Opus ($15/$75 per 1M)

Roughly 70% of production traffic is easy queries; routing them to Haiku can halve the monthly bill.

4. Stop sending the entire history

Resending every prior message every turn means re-paying for the same tokens. Three strategies:

a) Summarize the conversation

After N turns, replace the history with a single summary:

if len(messages) > 10:
    summary = client.messages.create(
        model="claude-haiku-4-5-20251001",
        system="Summarize as bullets. Preserve decisions, names, numbers.",
        messages=messages[:8] + [{"role": "user", "content": "Summarize."}],
    ).content[0].text

    messages = [{"role": "user", "content": f"<summary>\n{summary}\n</summary>"}] + messages[8:]

b) Sliding window

Keep only the last N turns; persist older state in your DB and pull via RAG when needed.

c) Semantic selection

Out of 50 past messages, pick the 5 most relevant to the current question via embedding similarity. The rest don't enter the prompt.

5. Smart RAG: fewer, better chunks

Default RAG often pulls top-10. Often 3-5 suffice — add a reranker to discard noise.

# Pull top-20, rerank to top-3
candidates = vector_db.search(query, top_k=20)
ranked = reranker.rank(query, candidates)
top_chunks = ranked[:3]

# 3 chunks × 500 tokens = 1500 tokens
# vs naive top-10: 5000 tokens
# 70% savings, often better answers

6. Compress the system prompt

Replace verbose prose with terse rule lists. 40–60% shrinkage is typical:

# Before (245 tokens):
You are a customer service assistant for our e-commerce store. Please
make sure to be polite and professional at all times. When responding
to customers, you should always reply in Turkish unless they ask
otherwise. Your responses should be concise but helpful, and you
should never offer discounts that are not officially approved by our
management team. If a customer asks about something you don't know,
politely inform them and offer to escalate to a human agent.

# After (89 tokens):
ROLE: e-commerce support assistant.
LANGUAGE: Turkish.
TONE: concise, professional.
RULE: never offer unapproved discounts. Escalate unknowns to humans.

Same information, 2.7× fewer tokens. The "please", "always", "all of these" filler dies.

7. Caveman / compact response mode

In Claude Code the /caveman skill cuts ~75%. With the API you can replicate it:

RESPONSE STYLE:
Cut filler. Drop articles ("the", "a"). Drop pleasantries.
Use fragments where clear. Code unchanged. Keep technical
substance. One word when one word enough.

Great for bug reports, debug sessions, internal APIs. Bad for customer-facing replies — too telegraphic.

8. Selective tool exposure

Each function-calling tool definition adds tokens. With 30 tools defined, dynamically filter by intent:

def select_tools(user_message):
    intent = classify_intent(user_message)

    tool_groups = {
        "files": [read_file, write_file, list_files],
        "github": [create_pr, list_issues, comment_pr],
        "data": [query_db, run_analytics],
    }

    return tool_groups[intent]

response = client.messages.create(
    model="claude-sonnet-4-6",
    tools=select_tools(user_message),  # 3 tools instead of 30
    messages=[...],
)

9. Batch API

Anthropic's Message Batches and OpenAI's Batch offer ~50% off for bulk async work (response within 24h):

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"req-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 256,
                "messages": [{"role": "user", "content": doc}],
            },
        }
        for i, doc in enumerate(documents)
    ]
)

# Results in hours, half off

Content moderation, sentiment analysis, document classification — all great fits.

10. Know your tokenization

The same data tokenizes differently:

Form Tokens (tiktoken cl100k_base)
id: 12345 4
"id":12345 5
id=12345 5
<id>12345</id> 8

Plain JSON tokenizes ~20-30% lighter than XML. UUIDs and long string IDs are extremely expensive; prefer short integer IDs when possible.

ROI-ordered priority

You don't need everything overnight. Order:

  1. Prompt caching (5-10× on hot endpoints)
  2. Smaller model (Haiku for easy queries)
  3. Structured output + max_tokens (shorter outputs)
  4. System prompt compression (every call shrinks)
  5. Smart RAG (top-3 + reranker)
  6. Conversation summarization (long chats)
  7. Batch API (async work)

Most products cut 50-70% of monthly cost following this order.

Measure first

Don't optimize blind. Establish a baseline. Anthropic console and OpenAI dashboard show usage and cost. Add:

  • Average input tokens per request
  • Average output tokens per request
  • Cost per endpoint
  • Most-used system prompts

Without these, you can't tell which optimization actually paid off.

Continue reading

  • Token — the atomic unit the model sees and why it's the unit of savings.
  • Context Window — the maximum tokens a model can hold at once.
  • System Prompt Guide — patterns for compact, effective system prompts.
  • KV Cache — the mechanism that makes long-context inference affordable.
  • MCP Server Catalog — selectively enabling tools to keep prompts small.