Regression — Explained

Definition

Regression is the sibling of classification in supervised learning. Where classification outputs a discrete label, regression outputs a continuous numeric value. "What is this house worth?" can be 1,250,000 or 1,250,001; it isn't a category, it's a real number.

Regression problems are usually solved by minimizing a loss function. The classic Mean Squared Error (MSE) penalizes how far predictions stray from truth, squared. The more outlier-friendly Mean Absolute Error (MAE) caps the impact of extreme errors. Picking one over the other is a question of how dangerous outliers are in your domain.

Models scale up: linear regression is the simplest, polynomial regression a bit more flexible, decision trees and gradient boosting (XGBoost, LightGBM, CatBoost) routinely top the charts on tabular data. Neural networks can learn high-dimensional, nonlinear relationships, but on small tabular sets gradient boosting often still wins.

Analogy

Think of an experienced realtor. They walk into a house and within 30 seconds say "around 1.4 million". Where did that number come from? Location, square meters, floor, building age, bathrooms, view, recent comps — they fold all that into a mental formula they refined over years. A regression model does exactly that: map many features to a single number, learned from past sales.

Real-world example

A taxi service wants to show riders an estimated fare before they book. It has 10M past trips: pickup, dropoff, distance, time of day, weather, traffic level, driver tenure. Output = the fare actually paid (continuous).

A trained gradient-boosting model, given "Levent → Üsküdar, Tue 17:30, rainy, 14.2 km", predicts 168 TRY. The actual was 174; a 3.5% error. If the test-set Mean Absolute Error stays under 4 TRY, the model ships. The prediction drives both transparency for riders and operational planning for the platform.

Code examples

scikit-learn · regression and error metricsPython

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

# X: features, y: continuous numeric target (e.g. price)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = GradientBoostingRegressor(n_estimators=300, max_depth=4)
model.fit(X_train, y_train)

preds = model.predict(X_test)

mae = mean_absolute_error(y_test, preds)
r2 = r2_score(y_test, preds)

print(f"Mean absolute error: {mae:.2f}")
print(f"R²: {r2:.3f}  (1.0 perfect, 0 = no better than mean)")

When to use

Predicting a continuous numeric value
Historical data has real values as labels
Measuring how far off you are (deviation) carries business value
You need a number, not a probability

When not to use

Output is categorical — pick classification
Prediction interval matters — use quantile regression or Bayesian models
You need to extrapolate well outside historical ranges — consider another approach

Common pitfalls

Not transforming the target

When the target spans 100K–10M, linear models can't fit it directly or learn slowly. Log or Box-Cox transform usually drops MAE substantially.

Ignoring outliers

A handful of extreme values (a typo'd 999,999,999 price) wreck MSE-based models. Use data cleaning, robust losses (Huber, MAE), or winsorization.

Only looking at R²

R² can be high while the model still blows up on subranges. Plot residuals; report MAE/RMSE per slice.