AI Atlas
Beginner· ~2 min read#confusion-matrix#classification#evaluation

Confusion Matrix

Where the model gets it right and wrong

A 2×2 (or N×N) table that contrasts predictions with true labels — the fastest way to see *what kind* of mistake your classifier makes.

CONFUSION MATRIXPredictedActualNegativePositiveNegPosTN99580FP320FN24TP76Tells you at a glance which kind of error the model makes.
Definition

For binary classification, the confusion matrix is a 2×2 table. Rows are true labels, columns are predictions. The four cells: TP (true positive — correctly called positive), TN (true negative — correctly called negative), FP (false positive — predicted positive, actually negative), FN (false negative — predicted negative, actually positive).

Every basic metric flows from these. Accuracy = (TP+TN)/total. Precision = TP/(TP+FP). Recall = TP/(TP+FN). Specificity = TN/(TN+FP). FPR = FP/(FP+TN). Reading the matrix is the fastest way to see which error type the model is biased toward.

For multi-class, the matrix is N×N. Diagonal = correct, off-diagonal = confusions between classes. Insights like "12% of dog predictions were tagged as wolf" come straight from here — and tell you which classes the model conflates.

Analogy

A medical test answering "is this patient sick?". Outcomes split four ways: actually sick + test positive (TP), actually healthy + test negative (TN), actually healthy + test positive (false alarm — FP), actually sick + test negative (missed case — FN). Which error costs more shapes how the test is calibrated. The confusion matrix is exactly the count of these four cases.

Real-world example

A bank evaluates a card-fraud model on a 100,000-transaction test set:

| | Pred: Legit | Pred: Fraud | |--------------|-------------|-------------| | Actual: Legit | 99,580 (TN) | 320 (FP) | | Actual: Fraud | 24 (FN) | 76 (TP) |

Read: 76 of 100 actual frauds were caught (recall 76%); 396 alarms total, 76 real (precision 19%). One in five alarms is real — the ops team has to triage 320 false alarms. Raising the threshold lifts precision but lowers recall; the business has to pick that trade-off from the cost of each error.

Code examples
scikit-learn · matrix and plotPython
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

preds = model.predict(X_test)

cm = confusion_matrix(y_test, preds, labels=[0, 1])
print(cm)
# [[TN  FP]
#  [FN  TP]]

disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=["legit", "fraud"],
)
disp.plot(cmap="Blues", values_format="d")
plt.savefig("cm.png", dpi=150, bbox_inches="tight")

print(classification_report(y_test, preds))

# Multi-class: normalized rows reveal class confusion
cm_norm = confusion_matrix(y_test, preds, normalize="true")
When to use
  • Every classification evaluation — fundamental diagnostic
  • Spotting which error type the model is weak on
  • Multi-class: spotting which classes are confused
  • Communicating model behavior to product/business
When not to use
  • Universal tool — always used
Common pitfalls

Reading multi-class without normalizing

When class sizes differ, raw counts mislead. Normalize by row to read 'X% of true X were labeled Y'.

Only looking at the diagonal

Diagonal sum is accuracy. The real insight is which off-diagonal cells are largest — what's getting confused with what.

Tiny test set

With few examples cell counts swing randomly. Compute confidence intervals; interpret accordingly.