Gradient Boosting
Sequentially correcting trees
Trains hundreds of small trees in sequence, each one focusing on the errors of those before it — typically the first choice for tabular data.
Gradient boosting is the boosting sibling of random forest's bagging. In bagging, trees are trained in parallel and independently; in boosting they're trained sequentially — each new tree fits the gradient of the loss of the cumulative predictions of the previous trees. The loss decreases step by step.
Different loss functions are supported (log-loss for classification, MSE for regression, custom objectives), and the boosting framework adapts to each. The key advantage: the model deliberately optimizes the loss landscape — which is why gradient boosting is rarely outside the top of any tabular ML competition.
Popular libraries: XGBoost (mature, well-documented), LightGBM (extremely fast, scales to large datasets), CatBoost (handles categorical features natively). All implement the same idea with different optimizations.
For a sales forecast you first hire a junior. They get close, but miss by 50K monthly. You hire a senior analyst and tell them "focus on the junior's mistakes". You then add a senior forecaster: "close the gap left by the first two." Three people together far outperform any one of them — each one targets the previous shortfalls. That's gradient boosting.
A credit card company wants a per-transaction fraud score. Data is highly imbalanced (0.1% fraud), high-dimensional (200+ features), with strongly nonlinear interactions (location + time + amount jointly matter). Logistic regression hits AUC 0.78, random forest 0.86, LightGBM 0.93.
The training proceeds in stages: the first 100 trees learn the coarse pattern (high amount + foreign country = risky); the next 100 capture subtler combinations (night + small amount + new card = a different fraud archetype). By tree 1,500 every tree is shaving off the last bit of residual error. End result: fast (50K txns/sec) and accurate.
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y)
train_set = lgb.Dataset(X_train, label=y_train)
val_set = lgb.Dataset(X_val, label=y_val, reference=train_set)
params = {
"objective": "binary",
"metric": "auc",
"learning_rate": 0.05,
"num_leaves": 63,
"feature_fraction": 0.8,
"bagging_fraction": 0.8,
"bagging_freq": 5,
"is_unbalance": True,
"verbose": -1,
}
model = lgb.train(
params,
train_set,
num_boost_round=2000,
valid_sets=[val_set],
callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)],
)
preds = model.predict(X_val)
print(f"AUC: {roc_auc_score(y_val, preds):.4f}")- Tabular data and accuracy is the priority — usually the first pick
- Complex nonlinear relationships and feature interactions
- Both classification and regression in one stack
- Less time spent on feature engineering — boosting is forgiving
- Tiny datasets (~<1K rows) — overfit risk; try linear models first
- Strict explainability without SHAP — boosting is mostly a black box
- Unstructured data: images, audio, text — deep learning dominates
No early stopping
Train for too many rounds and you overfit. Use early_stopping_rounds with a validation set.
Wrong overfit knobs
max_depth, num_leaves, min_child_samples, L1/L2 regularization — they interact. Don't tune one at a time; use Optuna or similar.
Mishandling categoricals
XGBoost and LightGBM need careful categorical handling (or one-hot). CatBoost handles them natively — use it if you have many.