Beginner· ~2 min read#naive-bayes#probability#text-classification

Naive Bayes

Probabilistic classifier with the independence trick

Applies Bayes' theorem assuming features are conditionally independent — simple, but a strong baseline for text classification.

Definition

Naive Bayes flips the question via Bayes' theorem: P(class | features) ∝ P(class) × ∏ P(feature | class). The "naive" part: features are assumed conditionally independent. That's almost never literally true, but it makes the math explosively simple — and the model still works.

Three flavors. Multinomial Naive Bayes suits count data (word counts) — spam filtering, language ID, news topic classification. Bernoulli Naive Bayes treats each feature as present/absent. Gaussian Naive Bayes assumes numeric features are normally distributed.

Training is fast: count conditional probabilities. Prediction is fast too. Speed plus simplicity made it the standard text classification baseline for decades. Modern transformers do better, but Naive Bayes often gets close with a fraction of the data and compute.

Analogy

A doctor reasoning "given fever + cough + fatigue, how likely is flu?" — treating each symptom independently. In reality fever and cough aren't independent, but the simplification works most of the time. Naive Bayes is the math version of that line of thought.

Real-world example

A mail service has to score "spam?" within milliseconds. Each email is represented by its word counts. Training data: 2M labeled emails.

Multinomial Naive Bayes computes per-word probabilities conditioned on class. "viagra" is 200× more likely in spam, "invoice" 50× more likely in ham. New email arrives → multiply the conditional probabilities of its words → pick the class with the higher posterior.

Accuracy 96%+, training in 30 seconds, inference in microseconds. A transformer might hit 98% but at 100× cost and latency. Most production systems still keep Naive Bayes as a first-line filter.

Code examples

scikit-learn · text classificationPython

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words="english", ngram_range=(1, 2), min_df=2)),
    ("clf", MultinomialNB(alpha=0.1)),  # Laplace smoothing
])

pipe.fit(texts_train, y_train)

preds = pipe.predict(texts_test)
print(classification_report(y_test, preds))

# Top spam-ish words
feature_names = pipe.named_steps["tfidf"].get_feature_names_out()
log_prob_diff = (
    pipe.named_steps["clf"].feature_log_prob_[1]
    - pipe.named_steps["clf"].feature_log_prob_[0]
)
top_spam = log_prob_diff.argsort()[-20:][::-1]
print("Most spam-indicative words:")
for i in top_spam:
    print(f"  {feature_names[i]}: {log_prob_diff[i]:+.3f}")

When to use

Text classification (spam, topic, language) — industry-standard baseline
You need very fast training and inference
Limited training data with many features (words)
A probabilistic, soft decision is needed

When not to use

Inter-feature relationships matter — the independence assumption hurts
Nonlinear interactions among numeric features — gradient boosting wins
You need calibrated probabilities — Naive Bayes pushes them toward 0/1

Common pitfalls

Zero-probability problem

An unseen word at test time → zero conditional probability → entire product collapses. Laplace smoothing (alpha) fixes it; always > 0.

Trusting probabilities literally

Naive Bayes is overconfident due to the independence trick. Use scores for ranking, not as calibrated probabilities.

Ignoring the independence assumption

Highly correlated features (price + cost) double-count evidence. Drop one or use feature selection.