Confusion Matrix
Where the model gets it right and wrong
A 2×2 (or N×N) table that contrasts predictions with true labels — the fastest way to see *what kind* of mistake your classifier makes.
For binary classification, the confusion matrix is a 2×2 table. Rows are true labels, columns are predictions. The four cells: TP (true positive — correctly called positive), TN (true negative — correctly called negative), FP (false positive — predicted positive, actually negative), FN (false negative — predicted negative, actually positive).
Every basic metric flows from these. Accuracy = (TP+TN)/total. Precision = TP/(TP+FP). Recall = TP/(TP+FN). Specificity = TN/(TN+FP). FPR = FP/(FP+TN). Reading the matrix is the fastest way to see which error type the model is biased toward.
For multi-class, the matrix is N×N. Diagonal = correct, off-diagonal = confusions between classes. Insights like "12% of dog predictions were tagged as wolf" come straight from here — and tell you which classes the model conflates.
A medical test answering "is this patient sick?". Outcomes split four ways: actually sick + test positive (TP), actually healthy + test negative (TN), actually healthy + test positive (false alarm — FP), actually sick + test negative (missed case — FN). Which error costs more shapes how the test is calibrated. The confusion matrix is exactly the count of these four cases.
A bank evaluates a card-fraud model on a 100,000-transaction test set:
| | Pred: Legit | Pred: Fraud | |--------------|-------------|-------------| | Actual: Legit | 99,580 (TN) | 320 (FP) | | Actual: Fraud | 24 (FN) | 76 (TP) |
Read: 76 of 100 actual frauds were caught (recall 76%); 396 alarms total, 76 real (precision 19%). One in five alarms is real — the ops team has to triage 320 false alarms. Raising the threshold lifts precision but lowers recall; the business has to pick that trade-off from the cost of each error.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
preds = model.predict(X_test)
cm = confusion_matrix(y_test, preds, labels=[0, 1])
print(cm)
# [[TN FP]
# [FN TP]]
disp = ConfusionMatrixDisplay(
confusion_matrix=cm,
display_labels=["legit", "fraud"],
)
disp.plot(cmap="Blues", values_format="d")
plt.savefig("cm.png", dpi=150, bbox_inches="tight")
print(classification_report(y_test, preds))
# Multi-class: normalized rows reveal class confusion
cm_norm = confusion_matrix(y_test, preds, normalize="true")- Every classification evaluation — fundamental diagnostic
- Spotting which error type the model is weak on
- Multi-class: spotting which classes are confused
- Communicating model behavior to product/business
- Universal tool — always used
Reading multi-class without normalizing
When class sizes differ, raw counts mislead. Normalize by row to read 'X% of true X were labeled Y'.
Only looking at the diagonal
Diagonal sum is accuracy. The real insight is which off-diagonal cells are largest — what's getting confused with what.
Tiny test set
With few examples cell counts swing randomly. Compute confidence intervals; interpret accordingly.