Loss Function — Explained

Definition

During training the loss function expresses "how good is the model right now?" as a single number. The model predicts, the prediction is compared to the truth, and the gap is a number. The training algorithm (typically gradient descent) updates parameters in the direction that reduces this number.

Loss is chosen by problem type and shapes how the model learns. Some standard choices:

- MSE (Mean Squared Error): classic regression loss. Squares the error — large mistakes are heavily penalized. Sensitive to outliers. - MAE (Mean Absolute Error): regression. More tolerant to outliers, but provides less informative gradients. - Huber loss: bridge between MSE and MAE. Acts like MSE for small errors, like MAE for large ones. - Cross-Entropy: the classification workhorse. Measures how far a probability prediction is from the true class — turns learning into a probabilistic objective. - Hinge loss: the SVM staple. Zero loss if the prediction is on the correct side with sufficient margin; proportional otherwise.

Picking the loss changes model behavior. Same data and architecture with a different loss can produce a meaningfully different model — choosing the right one is core engineering.

Analogy

Like an archery scoring scheme. It turns "how far from center?" into a number. MSE demands the bullseye and punishes outer rings hard. MAE penalizes every deviation equally. Hinge rewards anything close enough. The scheme you use shapes how the archer trains. The model is the same — the loss defines what it optimizes for.

Real-world example

Training profit forecasters on the same data with two losses: one MSE, one MAE.

- MSE version: Knows the average well, but a fluke 2M outlier in one season warps the model. The next season's ordinary forecasts come back consistently inflated. - MAE version: Less affected by the outlier, similar accuracy in normal range, more consistent average error.

The business says "we need consistent forecasts; we'll handle the outlier separately" → ship MAE. Same data, same algorithm — only the loss differs, yet behavior changes meaningfully.

Code examples

Effect of loss choicePython

from sklearn.linear_model import HuberRegressor, LinearRegression
from sklearn.metrics import mean_absolute_error
import numpy as np

models = {
    "Linear (MSE)": LinearRegression(),
    "Huber (Huber)": HuberRegressor(epsilon=1.35),
}

rng = np.random.default_rng(0)
n = 200
X = rng.normal(size=(n, 1))
y = (3 * X.squeeze() + rng.normal(scale=0.5, size=n))
y[:10] += 50  # 10 outliers

for name, m in models.items():
    m.fit(X, y)
    preds = m.predict(X)
    mae = mean_absolute_error(y, preds)
    print(f"{name:15s}  coef={m.coef_[0]:+.3f}  MAE={mae:.3f}")
# MSE-based fit is pulled by outliers;
# Huber stays closer to the true slope.

PyTorch · cross-entropy for classificationPython

import torch
import torch.nn as nn

criterion = nn.CrossEntropyLoss()

logits = torch.tensor([[2.0, 1.0, 0.1], [0.5, 2.5, 0.3]])  # 2 samples, 3 classes
targets = torch.tensor([0, 1])

loss = criterion(logits, targets)
print(f"Cross-entropy: {loss.item():.4f}")

When to use

Concept — chosen in every ML training run

When not to use

Conceptual — always required

Common pitfalls

Wrong loss for the problem

Using MSE where you need probabilities, sticking with MSE on outlier-heavy data, using MAE for classification — all produce poor results.

Confusing loss with evaluation metric

What you optimize during training and what you report at the end are usually different. Optimize cross-entropy, report F1. Different roles.

Forgetting the regularization term

In practice loss = data loss + λ × regularizer. Without tuning λ you end up over- or underfit.