BM25 — Explained · AI Sözlüğü

Definition

BM25 (Best Matching 25) is a relevance-scoring formula developed by Stephen Robertson et al. in 1994, and still the core of traditional search engines. Even in the AI era, it's a key piece of hybrid search and RAG systems.

Built from two components: - TF (Term Frequency): how many times does the query word appear in the document? More = more relevant. - IDF (Inverse Document Frequency): how rare is the word across all documents? Rarer = more distinctive.

BM25 adds two refinements over naive TF×IDF: term saturation (don't reward repeating the same word infinitely) and length normalization (short doc + exact match → high score).

Elasticsearch, OpenSearch, Lucene, Solr — all use BM25 as the default. Postgres tsvector uses a similar TF-IDF approach.

Analogy

The mathematical way to know how relevant a book is to your query. "TF": how many pages mention "cancer"? Many → that's the topic. "IDF": is "cancer" rare in the whole library or in every book? If everywhere, it doesn't differentiate. Together: a relevance score.

Real-world example

Query: "GDPR Article 17 right to erasure" Corpus: 10,000 legal docs.

BM25 (simplified): - Doc A: "GDPR" appears (TF=8, high IDF), "Article 17" appears (TF=3, very high IDF), "erasure" missing → score: 12.4 - Doc B: "GDPR" present (TF=2), "erasure" present (TF=15), no "Article 17" → score: 6.8 - Doc C: all terms present but a 500-page general report (long) → score: 9.1 (length adjustment)

Order: A > C > B. Doc A wins thanks to exact-term match + rare-IDF bonus + reasonable length.

When to use

Hybrid search — always pair BM25 with vector search
Exact-term search (code, SKUs, legal article numbers)
Low-resource environments — no embedding model required
Explainable search — you can show why a result ranked
Multilingual — independent of vector-model language

When not to use

Need semantic similarity only (synonyms, paraphrase)
Word order matters (BM25 treats words as a bag)
Very short docs (TF becomes meaningless)
Users with typos (BM25 has no tolerance)

Common pitfalls

Insufficient on its own (in modern context)

BM25 won't return a 'compliance' doc to a 'uyum' query. Hybrid (BM25 + vector) is standard practice. Relying on BM25 alone is being 30 years behind.

Forgetting stop words

English 'the', 'a', Turkish 'bir', 'ile' have high TF, low IDF — BM25 handles them naturally but bad config produces noise. Language-specific stopword lists are needed.

Skipping stemming

'Run', 'running', 'ran' counted as different tokens means lower TF. Stemmers (Snowball, Zemberek for TR) are essential. Without them you miss relevant docs.