Cross-Validation
Honest performance estimation
Splits the data into K parts and rotates which one is used for validation, averaging the scores — a more reliable estimate than a single train/test split.
A single train/test split can get lucky or unlucky. Cross-validation (CV) lowers that randomness by splitting the data into K equal folds. Each round uses K-1 folds for training and 1 for validation. After K rounds every example has appeared in validation once; the mean of the K scores is your performance estimate.
The most common variant is K-Fold (typically K=5 or 10). With class imbalance, Stratified K-Fold preserves class ratios in each fold. For temporal data TimeSeriesSplit walks forward without leaking the future. With user/group structure GroupKFold keeps a group's rows on the same side of the split.
Two main uses. Hyperparameter search: GridSearchCV / RandomizedSearchCV compute CV scores per candidate and pick the best. Honest performance estimate: see how much the model's score swings across folds, not just on one test set.
A teacher who wants to gauge a student's true grasp gives them five questions on the same topic. One question can be lucky or confused; the five-question average paints a fairer picture. CV does the same with the model — never trusting a single test fold.
An insurer trains a claims model on 12,000 customers. First the classic 80/20 split: train AUC 0.91, test 0.83. Is that overfit, or just an unlucky split? Unanswered.
5-fold cross-validation: 0.82, 0.84, 0.79, 0.85, 0.83 — mean 0.826, std 0.022. The real performance hovers around 0.82, swinging ±0.02. The single 0.83 was typical; no overfit. The team plans on the realistic ~0.82 with a 0.79 worst case.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import (
StratifiedKFold, cross_val_score, GridSearchCV,
)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = GradientBoostingClassifier()
scores = cross_val_score(model, X, y, cv=skf, scoring="roc_auc", n_jobs=-1)
print(f"AUC: {scores.mean():.3f} ± {scores.std():.3f}")
print(f"Fold scores: {scores}")
param_grid = {
"n_estimators": [100, 300, 500],
"max_depth": [3, 5, 7],
"learning_rate": [0.05, 0.1],
}
gs = GridSearchCV(model, param_grid, cv=skf, scoring="roc_auc", n_jobs=-1)
gs.fit(X, y)
print(f"Best: {gs.best_params_}, AUC={gs.best_score_:.3f}")- Honestly estimating model performance
- Small data — a single test fold isn't trustworthy
- Hyperparameter search — to avoid overfitting to one split
- Comparing models — statistically meaningful comparison
- Massive data and expensive training — single split is enough
- Sequential data — plain k-fold leaks the future. Use TimeSeriesSplit
- User/group records spread across folds → leakage. Use GroupKFold
Preprocessing leakage
Fitting a scaler or feature selector on the whole dataset before CV leaks information. Wrap preprocessing in a Pipeline so it refits per fold.
Skipping stratification
Plain K-fold with a 1% positive class can leave some folds with zero positives. Use StratifiedKFold.
K-fold on time series
Training on the future and validating on the past inflates results. Use TimeSeriesSplit (or rolling/expanding windows).