Overfitting — Explained

Definition

Overfitting happens when the model learns the noise and edge cases of the training set instead of the underlying pattern. Training accuracy keeps climbing while test accuracy drops — the classic signature. The model is overconfident, and reality won't reward that confidence.

Common causes: an over-flexible model (too many parameters, deep trees, too many iterations), insufficient or homogeneous training data, or distribution differences between train and test that were ignored. When model capacity exceeds what the data actually says, the model treats noise as pattern.

Standard defenses: validation set, cross-validation, regularization (L1, L2, dropout), early stopping, data augmentation, and getting more data.

Analogy

A student who memorizes the exact exam questions. They can answer every drilled question perfectly. On the final, the wording changes slightly — they freeze. They never understood the topic, only memorized question-answer pairs. Training accuracy 100%, test accuracy 50%. An overfit model is exactly that student.

Real-world example

A team trains a fraud detector on a small 800-row dataset. Random forest, no max_depth, 1000 trees. Training accuracy 99.8%. Excitement; ship it. Week one in production: real accuracy is 62%. Customer complaints explode.

The diagnosis: the model memorized 800 specific examples. "Tuesday 23:14, restaurant category, 487 TRY = ham" — meaningless memorization. Retraining with max_depth=8, 100 trees, 5-fold CV: training drops to 88% but test holds at 85%. The product succeeds. Letting go of memorization raised real accuracy.

Code examples

Spotting overfitting fastPython

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import learning_curve
import numpy as np
import matplotlib.pyplot as plt

train_sizes, train_scores, val_scores = learning_curve(
    RandomForestClassifier(max_depth=None, n_estimators=300),
    X, y, cv=5,
    train_sizes=np.linspace(0.1, 1.0, 8),
    scoring="f1_macro",
)

plt.plot(train_sizes, train_scores.mean(axis=1), label="Train")
plt.plot(train_sizes, val_scores.mean(axis=1), label="Validation")
plt.xlabel("Training samples")
plt.ylabel("F1")
plt.legend()
plt.savefig("learning_curve.png")

gap = train_scores.mean(axis=1)[-1] - val_scores.mean(axis=1)[-1]
print(f"Train-validation gap: {gap:.3f}")

When to use

Concept — to be checked in every ML workflow

When not to use

Concept — never stop checking

Common pitfalls

Looking only at training accuracy

99% on the training set → 'the model is great' is the oldest mistake. Without a held-out set, you know nothing about quality.

Peeking at the test set

Tweaking hyperparameters by checking the test set leaks information. Keep three splits — train, validation, test — and only touch the test set at final evaluation.

Misreading cross-validation

CV gives a sane estimate but doesn't protect you from domain shift. Production data might come from a different distribution.