Random Forest — Explained

Definition

A single decision tree is sensitive to its training data: a few different rows and the structure can flip. Random forest fixes that with two doses of randomness. Bagging trains each tree on a bootstrap (random with replacement) sample of the data; every tree sees a slightly different slice. On top of that, each split considers only a random subset of features, so the trees don't end up too similar.

At inference time each tree votes. Classification picks the majority class; regression averages predictions. The "average" of hundreds of trees has much lower variance than any single one and generalizes much better. This simple idea was the de facto best choice for tabular data for over a decade.

Random forest also gives you a near-free feature importance measure: shuffle a feature randomly, see how much accuracy drops. Quick, clean answer to "which inputs actually drive decisions?".

Analogy

Asking a panel of experts instead of one. Each expert has trained on slightly different sources, holds different blind spots, and is allowed to see slightly different evidence per question. They each cast a vote and you take the majority. Any single expert might be wrong; ten being wrong simultaneously is rare.

Real-world example

An insurer wants to predict claim cost for new applicants. Five years of two million policies are available: age, vehicle, driving history, location, usage, credit score — dozens of features.

A 500-tree random forest beats a single tuned decision tree by 14% MAE on the test set. The permutation feature importance surfaces "where the car is parked overnight" as twice as important as "vehicle make" — something the actuarial team hadn't seen coming. Both accuracy and insight were gained.

Code examples

scikit-learn · random forestPython

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
import numpy as np

rf = RandomForestClassifier(
    n_estimators=500,
    max_depth=None,           # let each tree grow fully
    min_samples_leaf=2,
    max_features="sqrt",      # √n features per split
    n_jobs=-1,                # parallel
    random_state=42,
)

rf.fit(X_train, y_train)

# Permutation importance is more reliable than impurity-based
result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
for i in result.importances_mean.argsort()[::-1][:10]:
    print(f"{feature_names[i]:30s}  {result.importances_mean[i]:+.4f}")

When to use

Strong baseline on tabular data
Some explainability needed (feature importance comes free)
Tolerance to missing values and outliers
Limited time for hyperparameter tuning, still want good results

When not to use

Speed matters at scale — gradient boosting (XGBoost/LightGBM) usually wins
Linear relationships dominate — a linear model is enough
Latency-critical inference — 500-tree summation can be slow

Common pitfalls

Too many tiny trees

Past ~500 estimators accuracy plateaus while inference cost grows. 200–500 is usually the sweet spot.

Biased feature importance

Impurity-based importance favors high-cardinality features. Prefer permutation importance for honest comparisons.

Thinking it can't overfit

Random forest is robust but not immune. On tiny datasets test scores still drop — verify with cross-validation.