Naive Bayes
Probabilistic classifier with the independence trick
Applies Bayes' theorem assuming features are conditionally independent — simple, but a strong baseline for text classification.
Naive Bayes flips the question via Bayes' theorem: P(class | features) ∝ P(class) × ∏ P(feature | class). The "naive" part: features are assumed conditionally independent. That's almost never literally true, but it makes the math explosively simple — and the model still works.
Three flavors. Multinomial Naive Bayes suits count data (word counts) — spam filtering, language ID, news topic classification. Bernoulli Naive Bayes treats each feature as present/absent. Gaussian Naive Bayes assumes numeric features are normally distributed.
Training is fast: count conditional probabilities. Prediction is fast too. Speed plus simplicity made it the standard text classification baseline for decades. Modern transformers do better, but Naive Bayes often gets close with a fraction of the data and compute.
A doctor reasoning "given fever + cough + fatigue, how likely is flu?" — treating each symptom independently. In reality fever and cough aren't independent, but the simplification works most of the time. Naive Bayes is the math version of that line of thought.
A mail service has to score "spam?" within milliseconds. Each email is represented by its word counts. Training data: 2M labeled emails.
Multinomial Naive Bayes computes per-word probabilities conditioned on class. "viagra" is 200× more likely in spam, "invoice" 50× more likely in ham. New email arrives → multiply the conditional probabilities of its words → pick the class with the higher posterior.
Accuracy 96%+, training in 30 seconds, inference in microseconds. A transformer might hit 98% but at 100× cost and latency. Most production systems still keep Naive Bayes as a first-line filter.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
pipe = Pipeline([
("tfidf", TfidfVectorizer(stop_words="english", ngram_range=(1, 2), min_df=2)),
("clf", MultinomialNB(alpha=0.1)), # Laplace smoothing
])
pipe.fit(texts_train, y_train)
preds = pipe.predict(texts_test)
print(classification_report(y_test, preds))
# Top spam-ish words
feature_names = pipe.named_steps["tfidf"].get_feature_names_out()
log_prob_diff = (
pipe.named_steps["clf"].feature_log_prob_[1]
- pipe.named_steps["clf"].feature_log_prob_[0]
)
top_spam = log_prob_diff.argsort()[-20:][::-1]
print("Most spam-indicative words:")
for i in top_spam:
print(f" {feature_names[i]}: {log_prob_diff[i]:+.3f}")- Text classification (spam, topic, language) — industry-standard baseline
- You need very fast training and inference
- Limited training data with many features (words)
- A probabilistic, soft decision is needed
- Inter-feature relationships matter — the independence assumption hurts
- Nonlinear interactions among numeric features — gradient boosting wins
- You need calibrated probabilities — Naive Bayes pushes them toward 0/1
Zero-probability problem
An unseen word at test time → zero conditional probability → entire product collapses. Laplace smoothing (alpha) fixes it; always > 0.
Trusting probabilities literally
Naive Bayes is overconfident due to the independence trick. Use scores for ranking, not as calibrated probabilities.
Ignoring the independence assumption
Highly correlated features (price + cost) double-count evidence. Drop one or use feature selection.