Writing ml
ml 14 min read 22 January 2024

Decision Trees and Ensembles: Intuition First

How decision trees work, why they overfit, and how ensemble methods — bagging, boosting, and stacking — transform weak learners into the models that dominate tabular ML competitions.

Decision trees are one of the most intuitive model families in machine learning. They’re also the foundation for the algorithms that win most tabular ML competitions — random forests, XGBoost, and LightGBM. Understanding trees deeply means understanding why ensembles work.

How Decision Trees Work

Two Ways to Think About Them

Programmatically: a decision tree is a nested sequence of if-else rules. At each node, it asks a question about one feature (Is age > 35? Is revenue < 10000?), splits the data into two groups, and recurses.

Geometrically: a decision tree partitions the feature space into axis-aligned hyperrectangles (cuboids in 2D, hyper-cuboids in higher dimensions). Each leaf corresponds to one region. The prediction for any new point is determined by which region it falls into.

Both views are correct. The geometric view explains why trees can approximate nonlinear boundaries — they can chop space into as many small boxes as needed — and why they generalize poorly without constraints: enough splits and the tree just memorizes the training set.

Splitting Criterion

At each node, the algorithm considers every feature and every possible split value, and picks the split that best separates the classes (for classification) or reduces variance (for regression).

Information Gain (classification):

$$IG(Y, \text{var}) = H(Y) - \sum_i \frac{|D_i|}{|D|} H_{D_i}(Y)$$

where $H$ is entropy. We want high information gain — the split should make each child node more “pure” than the parent.

Gini Impurity (classification, preferred in practice):

$$I_G(Y) = 1 - \sum_i (P(Y = y_i))^2$$

Gini impurity achieves the same goal as information gain but avoids the logarithm, which is computationally cheaper. Most implementations use Gini.

MSE Reduction (regression):

$$\text{MSE Reduction} = MSE_{\text{parent}} - \sum_{\text{children}} MSE_{\text{child}}$$

The tree predicts the mean of training targets in each leaf.

Why Trees Overfit

As tree depth increases, leaves contain fewer and fewer examples. Deep trees can reach depth where each leaf contains a single training point — perfect training accuracy, random generalization. The fix: limit depth, minimum samples per leaf, or use regularization during training.

Key Properties

Bootstrap Aggregation — Bagging

The idea: high-variance, low-bias models (deep trees) fail because they’re too sensitive to the specific training set. What if we trained many of them on different data samples and averaged?

  1. Bootstrap sampling: Sample k datasets of size m from training data, with replacement
  2. Train independently: build a separate model on each sample
  3. Aggregate: majority vote (classification) or mean (regression)

The noise in training data affects each sample differently. When you average many noisy predictions, the noise cancels out. The result: reduced variance, nearly unchanged bias.

Bagging works because the individual models are uncorrelated — each sees a different slice of the data.

Random Forest

Random Forest adds one twist to bagging: at each split, only a random subset of features is considered (typically √p for classification, p/3 for regression). This forces the trees to be more different from each other, which makes averaging them more effective.

Out-of-bag error: each training point is excluded from roughly 37% of bootstrap samples. These excluded examples can be used as a validation set without holding out separate data — no cross-validation needed.

Feature importance: computed as the total reduction in impurity across all splits that use a given feature, averaged across all trees. Gives a useful signal about which features matter, though it’s known to be biased toward high-cardinality features.

Hyperparameters that matter:

Random Forest parallelizes perfectly — trees are independent. Training time scales linearly with number of trees.

Boosting

Bagging makes models uncorrelated by training them independently. Boosting does the opposite: it builds models sequentially, where each model tries to correct the errors of the previous one.

The algorithm:

  1. Fit a weak model $M_0$ on the original targets. Get residuals $e_i = y_i - F_0(x_i)$.
  2. Fit $M_1$ on the residuals. Now $F_1(x) = \alpha_0 h_0(x) + \alpha_1 h_1(x)$.
  3. Compute new residuals, fit $M_2$ on those.
  4. Repeat for M stages.

Each model corrects what the previous ensemble got wrong. Start with a biased, high-error model; add models until bias is eliminated.

Why Gradient Boosting?

The name comes from viewing boosting as gradient descent in function space. For any differentiable loss function, the negative gradient of the loss with respect to the current predictions tells you which direction to push the model.

For MSE loss, the negative gradient is exactly the residual. For other losses (log loss, quantile), you get different “pseudo-residuals” that the next model targets.

This unification means you can plug in any loss function and get a boosted model optimized for it. That’s why gradient boosting handles regression, classification, and ranking with the same core algorithm.

XGBoost and LightGBM

Plain gradient boosting is slow (scikit-learn’s GradientBoostingClassifier is not production-grade). XGBoost and LightGBM are engineering triumphs:

XGBoost adds:

LightGBM further adds:

In practice: LightGBM is faster and often wins on large datasets. XGBoost is more stable on small-medium datasets. Both beat random forests on most tabular tasks.

Key hyperparameters (same across both):

AdaBoost

The historical precursor to gradient boosting. Instead of fitting residuals, AdaBoost assigns exponentially higher weights to misclassified examples after each round, forcing subsequent models to focus on hard cases.

Used most successfully in computer vision applications, particularly face detection (the Viola-Jones algorithm). Largely superseded by gradient boosting for general tabular tasks.

Stacking

Stacking is meta-learning: train several base models on the training data, then train a “meta-model” on their predictions.

The meta-model learns how to optimally combine the base models. If one base model is better on a particular region of input space, the meta-model learns to weight it more in that region.

Key rule: base models must make predictions on held-out data (cross-validation predictions) before feeding to the meta-model. Otherwise, the meta-model just learns to trust the most overfit base model.

The more different the base models, the better stacking works. Combining XGBoost, LightGBM, a linear model, and a neural network typically outperforms any single one.

Cascading

A practical pattern for high-stakes decisions: chain models where early-stage simple models filter easy cases, and later-stage complex (or human) models handle ambiguous ones.

Example: fraud detection pipeline

  1. Simple rule: flag transactions over $10,000. Route to human.
  2. Logistic regression: flag >90% fraud probability. Block.
  3. LightGBM: score remaining; flag >70% fraud. Review queue.
  4. Analyst: manual review of residual high-risk transactions.

Cascading trades off latency for accuracy on hard cases. The simple models handle 90%+ of volume cheaply; expensive models apply only where they add value.

Choosing Between Tree Methods

SituationUse
First baseline on tabular dataRandom Forest (robust, minimal tuning)
Maximum accuracy on tabular dataXGBoost or LightGBM with tuning
Large dataset (>1M rows)LightGBM (fastest)
Interpretability requiredSingle decision tree or logistic regression
Many categorical featuresLightGBM (native categorical support)
Imbalanced classesGradient boosting with scale_pos_weight

The reason XGBoost and LightGBM dominate tabular ML is not mysterious: boosting efficiently reduces bias by targeting residuals, subsampling and regularization control variance, and the result generalizes better than any individual model. That’s the whole story.

decision-trees random-forest xgboost gradient-boosting ensembles machine-learning
← All articles

Lets collaborate!

Whether you need a quantitative researcher, an machine learning systems builder, or a technical advisor — I'm available for select consulting engagements.

Get in Touch →