Decision Trees and Ensembles: Intuition First

Decision trees are one of the most intuitive model families in machine learning. They’re also the foundation for the algorithms that win most tabular ML competitions — random forests, XGBoost, and LightGBM. Understanding trees deeply means understanding why ensembles work.

How Decision Trees Work

Two Ways to Think About Them

Programmatically: a decision tree is a nested sequence of if-else rules. At each node, it asks a question about one feature (Is age > 35? Is revenue < 10000?), splits the data into two groups, and recurses.

Geometrically: a decision tree partitions the feature space into axis-aligned hyperrectangles (cuboids in 2D, hyper-cuboids in higher dimensions). Each leaf corresponds to one region. The prediction for any new point is determined by which region it falls into.

Both views are correct. The geometric view explains why trees can approximate nonlinear boundaries — they can chop space into as many small boxes as needed — and why they generalize poorly without constraints: enough splits and the tree just memorizes the training set.

Splitting Criterion

At each node, the algorithm considers every feature and every possible split value, and picks the split that best separates the classes (for classification) or reduces variance (for regression).

Information Gain (classification):

$$IG(Y, \text{var}) = H(Y) - \sum_i \frac{|D_i|}{|D|} H_{D_i}(Y)$$

where $H$ is entropy. We want high information gain — the split should make each child node more “pure” than the parent.

Gini Impurity (classification, preferred in practice):

$$I_G(Y) = 1 - \sum_i (P(Y = y_i))^2$$

Gini impurity achieves the same goal as information gain but avoids the logarithm, which is computationally cheaper. Most implementations use Gini.

MSE Reduction (regression):

$$\text{MSE Reduction} = MSE_{\text{parent}} - \sum_{\text{children}} MSE_{\text{child}}$$

The tree predicts the mean of training targets in each leaf.

Why Trees Overfit

As tree depth increases, leaves contain fewer and fewer examples. Deep trees can reach depth where each leaf contains a single training point — perfect training accuracy, random generalization. The fix: limit depth, minimum samples per leaf, or use regularization during training.

Key Properties

No feature scaling required. Splits are based on rank, not magnitude. Standardizing features doesn’t change the tree.
Handles mixed data types. Categorical and numerical features in the same model without special treatment.
High-cardinality categoricals are a problem. One-hot encoding creates many near-empty leaves. Better: target encode — replace each category with the conditional probability vector $P(y = i | \text{category} = j)$.
Fast inference. Prediction cost is O(depth), which is O(log n) for balanced trees. Suitable for low-latency applications.

Bootstrap Aggregation — Bagging

The idea: high-variance, low-bias models (deep trees) fail because they’re too sensitive to the specific training set. What if we trained many of them on different data samples and averaged?

Bootstrap sampling: Sample k datasets of size m from training data, with replacement
Train independently: build a separate model on each sample
Aggregate: majority vote (classification) or mean (regression)

The noise in training data affects each sample differently. When you average many noisy predictions, the noise cancels out. The result: reduced variance, nearly unchanged bias.

Bagging works because the individual models are uncorrelated — each sees a different slice of the data.

Random Forest

Random Forest adds one twist to bagging: at each split, only a random subset of features is considered (typically √p for classification, p/3 for regression). This forces the trees to be more different from each other, which makes averaging them more effective.

Out-of-bag error: each training point is excluded from roughly 37% of bootstrap samples. These excluded examples can be used as a validation set without holding out separate data — no cross-validation needed.

Feature importance: computed as the total reduction in impurity across all splits that use a given feature, averaged across all trees. Gives a useful signal about which features matter, though it’s known to be biased toward high-cardinality features.

Hyperparameters that matter:

Number of trees: more is always better until diminishing returns; use ~500 as default
Max features per split: tune this, it controls correlation between trees
Max depth and min samples per leaf: controls individual tree complexity

Random Forest parallelizes perfectly — trees are independent. Training time scales linearly with number of trees.

Boosting

Bagging makes models uncorrelated by training them independently. Boosting does the opposite: it builds models sequentially, where each model tries to correct the errors of the previous one.

The algorithm:

Fit a weak model $M_0$ on the original targets. Get residuals $e_i = y_i - F_0(x_i)$.
Fit $M_1$ on the residuals. Now $F_1(x) = \alpha_0 h_0(x) + \alpha_1 h_1(x)$.
Compute new residuals, fit $M_2$ on those.
Repeat for M stages.

Each model corrects what the previous ensemble got wrong. Start with a biased, high-error model; add models until bias is eliminated.

Why Gradient Boosting?

The name comes from viewing boosting as gradient descent in function space. For any differentiable loss function, the negative gradient of the loss with respect to the current predictions tells you which direction to push the model.

For MSE loss, the negative gradient is exactly the residual. For other losses (log loss, quantile), you get different “pseudo-residuals” that the next model targets.

This unification means you can plug in any loss function and get a boosted model optimized for it. That’s why gradient boosting handles regression, classification, and ranking with the same core algorithm.

XGBoost and LightGBM

Plain gradient boosting is slow (scikit-learn’s GradientBoostingClassifier is not production-grade). XGBoost and LightGBM are engineering triumphs:

XGBoost adds:

Row subsampling (bagging-like, per tree)
Column subsampling (per tree and per level)
L1 (reg_alpha) and L2 (reg_lambda) regularization on leaf weights
Second-order gradient approximation for better optimization
Approximate split finding for large datasets

LightGBM further adds:

Leaf-wise tree growth (vs. level-wise in XGBoost): grows the most impactful leaf at each step, not the full next level. Faster, but needs max_leaves constraint to prevent overfitting.
Gradient-based one-side sampling (GOSS): uses more high-gradient points, discards low-gradient ones
Exclusive feature bundling: compresses sparse features

In practice: LightGBM is faster and often wins on large datasets. XGBoost is more stable on small-medium datasets. Both beat random forests on most tabular tasks.

Key hyperparameters (same across both):

n_estimators (number of trees): higher is generally better; use early stopping to find the sweet spot
learning_rate: shrinks each tree’s contribution. Lower = needs more trees but generalizes better. Set to 0.05–0.1 and tune n_estimators.
max_depth / num_leaves: controls tree complexity
subsample and colsample_bytree: randomness that prevents overfitting

AdaBoost

The historical precursor to gradient boosting. Instead of fitting residuals, AdaBoost assigns exponentially higher weights to misclassified examples after each round, forcing subsequent models to focus on hard cases.

Used most successfully in computer vision applications, particularly face detection (the Viola-Jones algorithm). Largely superseded by gradient boosting for general tabular tasks.

Stacking

Stacking is meta-learning: train several base models on the training data, then train a “meta-model” on their predictions.

The meta-model learns how to optimally combine the base models. If one base model is better on a particular region of input space, the meta-model learns to weight it more in that region.

Key rule: base models must make predictions on held-out data (cross-validation predictions) before feeding to the meta-model. Otherwise, the meta-model just learns to trust the most overfit base model.

The more different the base models, the better stacking works. Combining XGBoost, LightGBM, a linear model, and a neural network typically outperforms any single one.

Cascading

A practical pattern for high-stakes decisions: chain models where early-stage simple models filter easy cases, and later-stage complex (or human) models handle ambiguous ones.

Example: fraud detection pipeline

Simple rule: flag transactions over $10,000. Route to human.
Logistic regression: flag >90% fraud probability. Block.
LightGBM: score remaining; flag >70% fraud. Review queue.
Analyst: manual review of residual high-risk transactions.

Cascading trades off latency for accuracy on hard cases. The simple models handle 90%+ of volume cheaply; expensive models apply only where they add value.

Choosing Between Tree Methods

Situation	Use
First baseline on tabular data	Random Forest (robust, minimal tuning)
Maximum accuracy on tabular data	XGBoost or LightGBM with tuning
Large dataset (>1M rows)	LightGBM (fastest)
Interpretability required	Single decision tree or logistic regression
Many categorical features	LightGBM (native categorical support)
Imbalanced classes	Gradient boosting with `scale_pos_weight`

The reason XGBoost and LightGBM dominate tabular ML is not mysterious: boosting efficiently reduces bias by targeting residuals, subsampling and regularization control variance, and the result generalizes better than any individual model. That’s the whole story.