ROC and AUC
Threshold-independent classifier ranking
ROC plots true vs false positive rates across all thresholds; AUC is the area under that curve. Together they evaluate a classifier's ranking ability without committing to a threshold.
The ROC curve plots True Positive Rate (TPR = recall) vs False Positive Rate (FPR) as the threshold varies. A curve hugging the top-left corner is good; the diagonal is random guessing. The area under the curve, AUC, is 0.5 for random and 1.0 for perfect.
AUC is valuable because it evaluates ranking without forcing a threshold. Specifically: it's the probability that a random positive scores higher than a random negative — making it a fair, objective metric to compare different models.
Under heavy class imbalance, ROC-AUC can look optimistic. Precision-Recall AUC (Average Precision) is then more informative — it focuses on the positive class. Roughly: ROC-AUC for balanced problems, PR-AUC for highly imbalanced.
A teacher ranking students by "could this kid finish first?". The teacher gives 100 ranks; you test the ranking by sampling actual top-finishers and asking "are they near the top?". AUC quantifies that question: is a random positive ranked above a random negative? At 100% → AUC=1.0; at 50% → AUC=0.5 (random).
An insurer compares logistic regression, random forest, and LightGBM. No imbalance issue, but no threshold yet either. Comparing by accuracy or F1 would lock in a threshold; AUC compares them threshold-free.
| Model | ROC-AUC | PR-AUC | |-------|---------|--------| | Logistic | 0.79 | 0.34 | | Random Forest | 0.86 | 0.48 | | LightGBM | 0.91 | 0.56 |
LightGBM wins on both general ranking and positive-focused ranking. Threshold selection happens later; the model winner is clear.
from sklearn.metrics import roc_curve, auc, roc_auc_score, average_precision_score
import matplotlib.pyplot as plt
probs = model.predict_proba(X_test)[:, 1]
fpr, tpr, thr = roc_curve(y_test, probs)
roc_auc = auc(fpr, tpr)
roc_auc_score_v = roc_auc_score(y_test, probs)
ap = average_precision_score(y_test, probs)
print(f"ROC-AUC: {roc_auc_score_v:.4f}")
print(f"PR-AUC (Average Precision): {ap:.4f}")
plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.3f}")
plt.plot([0, 1], [0, 1], "k--", alpha=0.5, label="random")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.savefig("roc.png", dpi=150, bbox_inches="tight")- Threshold-independent overall performance
- Objectively comparing multiple models
- Model selection before threshold is finalized
- Ranking problems (recommendation, search)
- Heavy class imbalance — ROC-AUC is optimistic; prefer PR-AUC
- A single fixed-threshold decision — go straight to precision/recall
- Multi-class — be careful with macro/micro/OvR/OvO aggregation
Treating AUC as calibrated probability
AUC is a ranking metric, not a probability. High AUC doesn't guarantee well-calibrated probabilities — apply Platt scaling or isotonic regression if needed.
Only ROC-AUC under heavy imbalance
0.95 ROC-AUC can hide a PR-AUC of 0.3. Always report both.
Ignoring statistical significance
Is 0.86 vs 0.87 real or noise? Use DeLong's test to compare two AUCs properly.