Intermediate· ~2 min read#roc#auc#classification-metrics

ROC and AUC

Threshold-independent classifier ranking

ROC plots true vs false positive rates across all thresholds; AUC is the area under that curve. Together they evaluate a classifier's ranking ability without committing to a threshold.

Definition

The ROC curve plots True Positive Rate (TPR = recall) vs False Positive Rate (FPR) as the threshold varies. A curve hugging the top-left corner is good; the diagonal is random guessing. The area under the curve, AUC, is 0.5 for random and 1.0 for perfect.

AUC is valuable because it evaluates ranking without forcing a threshold. Specifically: it's the probability that a random positive scores higher than a random negative — making it a fair, objective metric to compare different models.

Under heavy class imbalance, ROC-AUC can look optimistic. Precision-Recall AUC (Average Precision) is then more informative — it focuses on the positive class. Roughly: ROC-AUC for balanced problems, PR-AUC for highly imbalanced.

Analogy

A teacher ranking students by "could this kid finish first?". The teacher gives 100 ranks; you test the ranking by sampling actual top-finishers and asking "are they near the top?". AUC quantifies that question: is a random positive ranked above a random negative? At 100% → AUC=1.0; at 50% → AUC=0.5 (random).

Real-world example

An insurer compares logistic regression, random forest, and LightGBM. No imbalance issue, but no threshold yet either. Comparing by accuracy or F1 would lock in a threshold; AUC compares them threshold-free.

| Model | ROC-AUC | PR-AUC | |-------|---------|--------| | Logistic | 0.79 | 0.34 | | Random Forest | 0.86 | 0.48 | | LightGBM | 0.91 | 0.56 |

LightGBM wins on both general ranking and positive-focused ranking. Threshold selection happens later; the model winner is clear.

Code examples

scikit-learn · ROC curve + AUCPython

from sklearn.metrics import roc_curve, auc, roc_auc_score, average_precision_score
import matplotlib.pyplot as plt

probs = model.predict_proba(X_test)[:, 1]

fpr, tpr, thr = roc_curve(y_test, probs)
roc_auc = auc(fpr, tpr)

roc_auc_score_v = roc_auc_score(y_test, probs)
ap = average_precision_score(y_test, probs)
print(f"ROC-AUC: {roc_auc_score_v:.4f}")
print(f"PR-AUC (Average Precision): {ap:.4f}")

plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.3f}")
plt.plot([0, 1], [0, 1], "k--", alpha=0.5, label="random")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.savefig("roc.png", dpi=150, bbox_inches="tight")

When to use

Threshold-independent overall performance
Objectively comparing multiple models
Model selection before threshold is finalized
Ranking problems (recommendation, search)

When not to use

Heavy class imbalance — ROC-AUC is optimistic; prefer PR-AUC
A single fixed-threshold decision — go straight to precision/recall
Multi-class — be careful with macro/micro/OvR/OvO aggregation

Common pitfalls

Treating AUC as calibrated probability

AUC is a ranking metric, not a probability. High AUC doesn't guarantee well-calibrated probabilities — apply Platt scaling or isotonic regression if needed.

Only ROC-AUC under heavy imbalance

0.95 ROC-AUC can hide a PR-AUC of 0.3. Always report both.

Ignoring statistical significance

Is 0.86 vs 0.87 real or noise? Use DeLong's test to compare two AUCs properly.