Precision and Recall
Two complementary metrics
Two essential classification metrics: Precision = 'of the items I called positive, how many were correct?', Recall = 'of all the actual positives, how many did I catch?'.
Accuracy alone is often misleading, especially under class imbalance. Precision and recall answer two different questions.
Precision: of the predictions labeled positive, how many are truly positive? = TP / (TP + FP). When false alarms are expensive, optimize precision. Spam filter sending an important email to spam is expensive → high precision wins.
Recall (sensitivity): of all actual positives, how many did the model catch? = TP / (TP + FN). When missing matters, optimize recall. Cancer screening that misses a sick patient is far costlier than a false alarm → high recall wins.
There's a natural tension. Raise the threshold → only confident positives → precision up, recall down. Lower the threshold → the reverse. F1 score is the harmonic mean of the two — a single number when you need to summarize both. Macro-F1 weighs each class equally, useful for imbalance.
Casting a fishing net. Precision: what fraction of what you pulled in was actually salmon (the target)? If the net hauled old tires, plastic, crabs — precision is low. Recall: of all salmon in the sea, what fraction did you catch? If a school swam past — recall is low.
A tight net catches little, mostly salmon (high precision, low recall). A wide net catches everything, salmon mixed with junk (low precision, high recall). The right answer depends on the business.
A site builds a hate-speech detector for comments. Three thresholds are evaluated:
| Threshold | Precision | Recall | Reading | |-----------|-----------|--------|---------| | 0.3 | 62% | 94% | Catches almost all but 38% are wrong flags | | 0.5 | 81% | 78% | Balanced | | 0.7 | 93% | 52% | Few mistakes but half slip through |
Product picks: "false flags lose users; misses can be caught by human moderators" → threshold 0.7. Auto-filter handles 52%, the rest goes to humans. Which metric you prioritize directly changes the business decision.
from sklearn.metrics import (
classification_report, precision_recall_curve,
average_precision_score,
)
import matplotlib.pyplot as plt
probs = model.predict_proba(X_test)[:, 1]
preds = (probs > 0.5).astype(int)
print(classification_report(y_test, preds, target_names=["neg", "pos"]))
prec, rec, thr = precision_recall_curve(y_test, probs)
ap = average_precision_score(y_test, probs)
plt.plot(rec, prec, label=f"AP={ap:.3f}")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend()
plt.savefig("pr_curve.png")
# Pick threshold based on business target (e.g., recall ≥ 80%)
target = 0.8
ok = rec >= target
idx = ok.argmax()
print(f"Best precision at recall ≥ {target}: {prec[idx]:.3f}")
print(f"Threshold there: {thr[idx]:.3f}")- Imbalanced classes — accuracy is misleading
- Asymmetric costs of false positives vs false negatives
- Justifying a threshold to product/business
- Information retrieval and recommendation systems
- These are universal classification metrics — always relevant
Only looking at F1
F1 is a mean. A model with precision 20% / recall 95% has decent F1 but very lopsided behavior. Always show all three.
Confusing macro vs micro F1
Macro weighs each class equally — pick it when small classes matter. Micro weighs by sample count — the big class dominates.
Picking the threshold without business input
0.5 is a default, not a decision. Compute the cost of false positives vs negatives and tune accordingly.