Precision and Recall — Explained

Definition

Accuracy alone is often misleading, especially under class imbalance. Precision and recall answer two different questions.

Precision: of the predictions labeled positive, how many are truly positive? = TP / (TP + FP). When false alarms are expensive, optimize precision. Spam filter sending an important email to spam is expensive → high precision wins.

Recall (sensitivity): of all actual positives, how many did the model catch? = TP / (TP + FN). When missing matters, optimize recall. Cancer screening that misses a sick patient is far costlier than a false alarm → high recall wins.

There's a natural tension. Raise the threshold → only confident positives → precision up, recall down. Lower the threshold → the reverse. F1 score is the harmonic mean of the two — a single number when you need to summarize both. Macro-F1 weighs each class equally, useful for imbalance.

Analogy

Casting a fishing net. Precision: what fraction of what you pulled in was actually salmon (the target)? If the net hauled old tires, plastic, crabs — precision is low. Recall: of all salmon in the sea, what fraction did you catch? If a school swam past — recall is low.

A tight net catches little, mostly salmon (high precision, low recall). A wide net catches everything, salmon mixed with junk (low precision, high recall). The right answer depends on the business.

Real-world example

A site builds a hate-speech detector for comments. Three thresholds are evaluated:

| Threshold | Precision | Recall | Reading | |-----------|-----------|--------|---------| | 0.3 | 62% | 94% | Catches almost all but 38% are wrong flags | | 0.5 | 81% | 78% | Balanced | | 0.7 | 93% | 52% | Few mistakes but half slip through |

Product picks: "false flags lose users; misses can be caught by human moderators" → threshold 0.7. Auto-filter handles 52%, the rest goes to humans. Which metric you prioritize directly changes the business decision.

Code examples

scikit-learn · classification report + PR curvePython

from sklearn.metrics import (
    classification_report, precision_recall_curve,
    average_precision_score,
)
import matplotlib.pyplot as plt

probs = model.predict_proba(X_test)[:, 1]

preds = (probs > 0.5).astype(int)
print(classification_report(y_test, preds, target_names=["neg", "pos"]))

prec, rec, thr = precision_recall_curve(y_test, probs)
ap = average_precision_score(y_test, probs)

plt.plot(rec, prec, label=f"AP={ap:.3f}")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend()
plt.savefig("pr_curve.png")

# Pick threshold based on business target (e.g., recall ≥ 80%)
target = 0.8
ok = rec >= target
idx = ok.argmax()
print(f"Best precision at recall ≥ {target}: {prec[idx]:.3f}")
print(f"Threshold there: {thr[idx]:.3f}")

When to use

Imbalanced classes — accuracy is misleading
Asymmetric costs of false positives vs false negatives
Justifying a threshold to product/business
Information retrieval and recommendation systems

When not to use

These are universal classification metrics — always relevant

Common pitfalls

Only looking at F1

F1 is a mean. A model with precision 20% / recall 95% has decent F1 but very lopsided behavior. Always show all three.

Confusing macro vs micro F1

Macro weighs each class equally — pick it when small classes matter. Micro weighs by sample count — the big class dominates.

Picking the threshold without business input

0.5 is a default, not a decision. Compute the cost of false positives vs negatives and tune accordingly.