Beginner· ~2 min read#cross-validation#k-fold#evaluation

Cross-Validation

Honest performance estimation

Splits the data into K parts and rotates which one is used for validation, averaging the scores — a more reliable estimate than a single train/test split.

Definition

A single train/test split can get lucky or unlucky. Cross-validation (CV) lowers that randomness by splitting the data into K equal folds. Each round uses K-1 folds for training and 1 for validation. After K rounds every example has appeared in validation once; the mean of the K scores is your performance estimate.

The most common variant is K-Fold (typically K=5 or 10). With class imbalance, Stratified K-Fold preserves class ratios in each fold. For temporal data TimeSeriesSplit walks forward without leaking the future. With user/group structure GroupKFold keeps a group's rows on the same side of the split.

Two main uses. Hyperparameter search: GridSearchCV / RandomizedSearchCV compute CV scores per candidate and pick the best. Honest performance estimate: see how much the model's score swings across folds, not just on one test set.

Analogy

A teacher who wants to gauge a student's true grasp gives them five questions on the same topic. One question can be lucky or confused; the five-question average paints a fairer picture. CV does the same with the model — never trusting a single test fold.

Real-world example

An insurer trains a claims model on 12,000 customers. First the classic 80/20 split: train AUC 0.91, test 0.83. Is that overfit, or just an unlucky split? Unanswered.

5-fold cross-validation: 0.82, 0.84, 0.79, 0.85, 0.83 — mean 0.826, std 0.022. The real performance hovers around 0.82, swinging ±0.02. The single 0.83 was typical; no overfit. The team plans on the realistic ~0.82 with a 0.79 worst case.

Code examples

scikit-learn · stratified k-fold + tuningPython

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import (
    StratifiedKFold, cross_val_score, GridSearchCV,
)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

model = GradientBoostingClassifier()
scores = cross_val_score(model, X, y, cv=skf, scoring="roc_auc", n_jobs=-1)
print(f"AUC: {scores.mean():.3f} ± {scores.std():.3f}")
print(f"Fold scores: {scores}")

param_grid = {
    "n_estimators": [100, 300, 500],
    "max_depth": [3, 5, 7],
    "learning_rate": [0.05, 0.1],
}
gs = GridSearchCV(model, param_grid, cv=skf, scoring="roc_auc", n_jobs=-1)
gs.fit(X, y)
print(f"Best: {gs.best_params_}, AUC={gs.best_score_:.3f}")

When to use

Honestly estimating model performance
Small data — a single test fold isn't trustworthy
Hyperparameter search — to avoid overfitting to one split
Comparing models — statistically meaningful comparison

When not to use

Massive data and expensive training — single split is enough
Sequential data — plain k-fold leaks the future. Use TimeSeriesSplit
User/group records spread across folds → leakage. Use GroupKFold

Common pitfalls

Preprocessing leakage

Fitting a scaler or feature selector on the whole dataset before CV leaks information. Wrap preprocessing in a Pipeline so it refits per fold.

Skipping stratification

Plain K-fold with a 1% positive class can leave some folds with zero positives. Use StratifiedKFold.

K-fold on time series

Training on the future and validating on the past inflates results. Use TimeSeriesSplit (or rolling/expanding windows).