Classification — Explained

Definition

Classification is one of the two main families in supervised learning. The model is asked to pick one label from a predefined set for every input it sees. Three flavors based on label count: binary classification (spam/ham, malignant/benign), multi-class (dog/cat/ bird/fish), and multi-label (a movie can be both action and comedy).

The output is rarely a hard label; it's usually a probability distribution over classes. The model says "87% spam, 13% ham" and you turn that into a decision with a threshold. The threshold moves with the cost of mistakes: in cancer screening you keep it high to avoid false positives; in fraud detection you lower it to avoid missing fraud.

Classification is an old problem; logistic regression, decision trees, SVMs, k-NN, and neural networks all solve it. Choosing among them depends on dataset size, explainability needs, class imbalance, and so on.

Analogy

Picture a postal sorter at a mail room. They glance at every envelope and toss it in the right slot: "Marketing", "Bills", "Important", "Junk". With the patterns they learned years ago, they sort dozens a second without thinking. Sometimes they hesitate — they're weighing probabilities, and that's exactly what your model produces. The threshold is the sorter's caution: when uncertain, send it to "Important" because the cost of misfiling that one is highest.

Real-world example

A news site wants to auto-flag hate speech in comments. Editors have hand-labeled 50,000 comments as "clean", "profanity", or "hate speech". A three-class classification problem.

The trained model outputs three probabilities per new comment. The product lead picks thresholds: above 40% "profanity" → send to moderator queue; above 25% "hate speech" → auto-hide. When a mistake surfaces, the lead tightens or relaxes a threshold. There is no single "safety band"; each class is tuned by its business cost.

Code examples

scikit-learn · binary classificationPython

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# X: features, y: 0 (ham) or 1 (spam)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LogisticRegression()
model.fit(X_train, y_train)

# Probability predictions
probs = model.predict_proba(X_test)[:, 1]

# Pick threshold by cost (default 0.5)
threshold = 0.4
preds = (probs > threshold).astype(int)

# Always look at precision/recall/F1, not just accuracy
print(classification_report(y_test, preds, target_names=["ham", "spam"]))

When to use

Output falls into a finite, discrete set of labels
You have historical data with the correct answers
A decision needs to be automated: archive, block, escalate
Knowing per-class probabilities matters (for threshold tuning)

When not to use

Output is continuous numeric — use regression
Thousands of classes with only a few examples each
Labels are too expensive to collect — try unsupervised or semi-supervised

Common pitfalls

Only looking at accuracy

If 99 of 100 are 'ham', a model that always says 'ham' scores 99% — and finds zero spam. Precision, recall, F1, confusion matrix are non-negotiable.

Forgetting the threshold

Most libraries default to 0.5. In health, security, fraud, that's wrong. Compute the cost of false positives vs false negatives and tune accordingly.

Ignoring class imbalance

If the positive class is under 1%, a model can look 'good' by predicting negative. Use SMOTE, class weights, balanced sampling to fix the imbalance.