Overfitting
Memorizing instead of learning
When a model memorizes the training data and fails on unseen data — the most common, most dangerous trap in ML.
Overfitting happens when the model learns the noise and edge cases of the training set instead of the underlying pattern. Training accuracy keeps climbing while test accuracy drops — the classic signature. The model is overconfident, and reality won't reward that confidence.
Common causes: an over-flexible model (too many parameters, deep trees, too many iterations), insufficient or homogeneous training data, or distribution differences between train and test that were ignored. When model capacity exceeds what the data actually says, the model treats noise as pattern.
Standard defenses: validation set, cross-validation, regularization (L1, L2, dropout), early stopping, data augmentation, and getting more data.
A student who memorizes the exact exam questions. They can answer every drilled question perfectly. On the final, the wording changes slightly — they freeze. They never understood the topic, only memorized question-answer pairs. Training accuracy 100%, test accuracy 50%. An overfit model is exactly that student.
A team trains a fraud detector on a small 800-row dataset. Random forest, no max_depth, 1000 trees. Training accuracy 99.8%. Excitement; ship it. Week one in production: real accuracy is 62%. Customer complaints explode.
The diagnosis: the model memorized 800 specific examples. "Tuesday 23:14, restaurant category, 487 TRY = ham" — meaningless memorization. Retraining with max_depth=8, 100 trees, 5-fold CV: training drops to 88% but test holds at 85%. The product succeeds. Letting go of memorization raised real accuracy.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import learning_curve
import numpy as np
import matplotlib.pyplot as plt
train_sizes, train_scores, val_scores = learning_curve(
RandomForestClassifier(max_depth=None, n_estimators=300),
X, y, cv=5,
train_sizes=np.linspace(0.1, 1.0, 8),
scoring="f1_macro",
)
plt.plot(train_sizes, train_scores.mean(axis=1), label="Train")
plt.plot(train_sizes, val_scores.mean(axis=1), label="Validation")
plt.xlabel("Training samples")
plt.ylabel("F1")
plt.legend()
plt.savefig("learning_curve.png")
gap = train_scores.mean(axis=1)[-1] - val_scores.mean(axis=1)[-1]
print(f"Train-validation gap: {gap:.3f}")- Concept — to be checked in every ML workflow
- Concept — never stop checking
Looking only at training accuracy
99% on the training set → 'the model is great' is the oldest mistake. Without a held-out set, you know nothing about quality.
Peeking at the test set
Tweaking hyperparameters by checking the test set leaks information. Keep three splits — train, validation, test — and only touch the test set at final evaluation.
Misreading cross-validation
CV gives a sane estimate but doesn't protect you from domain shift. Production data might come from a different distribution.