Loss Function
What the model optimizes
A function that quantifies how far the model's prediction is from the truth — training searches for parameters that minimize it.
During training the loss function expresses "how good is the model right now?" as a single number. The model predicts, the prediction is compared to the truth, and the gap is a number. The training algorithm (typically gradient descent) updates parameters in the direction that reduces this number.
Loss is chosen by problem type and shapes how the model learns. Some standard choices:
- MSE (Mean Squared Error): classic regression loss. Squares the error — large mistakes are heavily penalized. Sensitive to outliers. - MAE (Mean Absolute Error): regression. More tolerant to outliers, but provides less informative gradients. - Huber loss: bridge between MSE and MAE. Acts like MSE for small errors, like MAE for large ones. - Cross-Entropy: the classification workhorse. Measures how far a probability prediction is from the true class — turns learning into a probabilistic objective. - Hinge loss: the SVM staple. Zero loss if the prediction is on the correct side with sufficient margin; proportional otherwise.
Picking the loss changes model behavior. Same data and architecture with a different loss can produce a meaningfully different model — choosing the right one is core engineering.
Like an archery scoring scheme. It turns "how far from center?" into a number. MSE demands the bullseye and punishes outer rings hard. MAE penalizes every deviation equally. Hinge rewards anything close enough. The scheme you use shapes how the archer trains. The model is the same — the loss defines what it optimizes for.
Training profit forecasters on the same data with two losses: one MSE, one MAE.
- MSE version: Knows the average well, but a fluke 2M outlier in one season warps the model. The next season's ordinary forecasts come back consistently inflated. - MAE version: Less affected by the outlier, similar accuracy in normal range, more consistent average error.
The business says "we need consistent forecasts; we'll handle the outlier separately" → ship MAE. Same data, same algorithm — only the loss differs, yet behavior changes meaningfully.
from sklearn.linear_model import HuberRegressor, LinearRegression
from sklearn.metrics import mean_absolute_error
import numpy as np
models = {
"Linear (MSE)": LinearRegression(),
"Huber (Huber)": HuberRegressor(epsilon=1.35),
}
rng = np.random.default_rng(0)
n = 200
X = rng.normal(size=(n, 1))
y = (3 * X.squeeze() + rng.normal(scale=0.5, size=n))
y[:10] += 50 # 10 outliers
for name, m in models.items():
m.fit(X, y)
preds = m.predict(X)
mae = mean_absolute_error(y, preds)
print(f"{name:15s} coef={m.coef_[0]:+.3f} MAE={mae:.3f}")
# MSE-based fit is pulled by outliers;
# Huber stays closer to the true slope.import torch
import torch.nn as nn
criterion = nn.CrossEntropyLoss()
logits = torch.tensor([[2.0, 1.0, 0.1], [0.5, 2.5, 0.3]]) # 2 samples, 3 classes
targets = torch.tensor([0, 1])
loss = criterion(logits, targets)
print(f"Cross-entropy: {loss.item():.4f}")- Concept — chosen in every ML training run
- Conceptual — always required
Wrong loss for the problem
Using MSE where you need probabilities, sticking with MSE on outlier-heavy data, using MAE for classification — all produce poor results.
Confusing loss with evaluation metric
What you optimize during training and what you report at the end are usually different. Optimize cross-entropy, report F1. Different roles.
Forgetting the regularization term
In practice loss = data loss + λ × regularizer. Without tuning λ you end up over- or underfit.