Decision trees are one of the most intuitive model families in machine learning. They’re also the foundation for the algorithms that win most tabular ML competitions — random forests, XGBoost, and LightGBM. Understanding trees deeply means understanding why ensembles work.
How Decision Trees Work
Two Ways to Think About Them
Programmatically: a decision tree is a nested sequence of if-else rules. At each node, it asks a question about one feature (Is age > 35? Is revenue < 10000?), splits the data into two groups, and recurses.
Geometrically: a decision tree partitions the feature space into axis-aligned hyperrectangles (cuboids in 2D, hyper-cuboids in higher dimensions). Each leaf corresponds to one region. The prediction for any new point is determined by which region it falls into.
Both views are correct. The geometric view explains why trees can approximate nonlinear boundaries — they can chop space into as many small boxes as needed — and why they generalize poorly without constraints: enough splits and the tree just memorizes the training set.
Splitting Criterion
At each node, the algorithm considers every feature and every possible split value, and picks the split that best separates the classes (for classification) or reduces variance (for regression).
Information Gain (classification):
$$IG(Y, \text{var}) = H(Y) - \sum_i \frac{|D_i|}{|D|} H_{D_i}(Y)$$
where $H$ is entropy. We want high information gain — the split should make each child node more “pure” than the parent.
Gini Impurity (classification, preferred in practice):
$$I_G(Y) = 1 - \sum_i (P(Y = y_i))^2$$
Gini impurity achieves the same goal as information gain but avoids the logarithm, which is computationally cheaper. Most implementations use Gini.
MSE Reduction (regression):
$$\text{MSE Reduction} = MSE_{\text{parent}} - \sum_{\text{children}} MSE_{\text{child}}$$
The tree predicts the mean of training targets in each leaf.
Why Trees Overfit
As tree depth increases, leaves contain fewer and fewer examples. Deep trees can reach depth where each leaf contains a single training point — perfect training accuracy, random generalization. The fix: limit depth, minimum samples per leaf, or use regularization during training.
Key Properties
- No feature scaling required. Splits are based on rank, not magnitude. Standardizing features doesn’t change the tree.
- Handles mixed data types. Categorical and numerical features in the same model without special treatment.
- High-cardinality categoricals are a problem. One-hot encoding creates many near-empty leaves. Better: target encode — replace each category with the conditional probability vector $P(y = i | \text{category} = j)$.
- Fast inference. Prediction cost is O(depth), which is O(log n) for balanced trees. Suitable for low-latency applications.
Bootstrap Aggregation — Bagging
The idea: high-variance, low-bias models (deep trees) fail because they’re too sensitive to the specific training set. What if we trained many of them on different data samples and averaged?
- Bootstrap sampling: Sample k datasets of size m from training data, with replacement
- Train independently: build a separate model on each sample
- Aggregate: majority vote (classification) or mean (regression)
The noise in training data affects each sample differently. When you average many noisy predictions, the noise cancels out. The result: reduced variance, nearly unchanged bias.
Bagging works because the individual models are uncorrelated — each sees a different slice of the data.
Random Forest
Random Forest adds one twist to bagging: at each split, only a random subset of features is considered (typically √p for classification, p/3 for regression). This forces the trees to be more different from each other, which makes averaging them more effective.
Out-of-bag error: each training point is excluded from roughly 37% of bootstrap samples. These excluded examples can be used as a validation set without holding out separate data — no cross-validation needed.
Feature importance: computed as the total reduction in impurity across all splits that use a given feature, averaged across all trees. Gives a useful signal about which features matter, though it’s known to be biased toward high-cardinality features.
Hyperparameters that matter:
- Number of trees: more is always better until diminishing returns; use ~500 as default
- Max features per split: tune this, it controls correlation between trees
- Max depth and min samples per leaf: controls individual tree complexity
Random Forest parallelizes perfectly — trees are independent. Training time scales linearly with number of trees.
Boosting
Bagging makes models uncorrelated by training them independently. Boosting does the opposite: it builds models sequentially, where each model tries to correct the errors of the previous one.
The algorithm:
- Fit a weak model $M_0$ on the original targets. Get residuals $e_i = y_i - F_0(x_i)$.
- Fit $M_1$ on the residuals. Now $F_1(x) = \alpha_0 h_0(x) + \alpha_1 h_1(x)$.
- Compute new residuals, fit $M_2$ on those.
- Repeat for M stages.
Each model corrects what the previous ensemble got wrong. Start with a biased, high-error model; add models until bias is eliminated.
Why Gradient Boosting?
The name comes from viewing boosting as gradient descent in function space. For any differentiable loss function, the negative gradient of the loss with respect to the current predictions tells you which direction to push the model.
For MSE loss, the negative gradient is exactly the residual. For other losses (log loss, quantile), you get different “pseudo-residuals” that the next model targets.
This unification means you can plug in any loss function and get a boosted model optimized for it. That’s why gradient boosting handles regression, classification, and ranking with the same core algorithm.
XGBoost and LightGBM
Plain gradient boosting is slow (scikit-learn’s GradientBoostingClassifier is not production-grade). XGBoost and LightGBM are engineering triumphs:
XGBoost adds:
- Row subsampling (bagging-like, per tree)
- Column subsampling (per tree and per level)
- L1 (
reg_alpha) and L2 (reg_lambda) regularization on leaf weights - Second-order gradient approximation for better optimization
- Approximate split finding for large datasets
LightGBM further adds:
- Leaf-wise tree growth (vs. level-wise in XGBoost): grows the most impactful leaf at each step, not the full next level. Faster, but needs max_leaves constraint to prevent overfitting.
- Gradient-based one-side sampling (GOSS): uses more high-gradient points, discards low-gradient ones
- Exclusive feature bundling: compresses sparse features
In practice: LightGBM is faster and often wins on large datasets. XGBoost is more stable on small-medium datasets. Both beat random forests on most tabular tasks.
Key hyperparameters (same across both):
n_estimators(number of trees): higher is generally better; use early stopping to find the sweet spotlearning_rate: shrinks each tree’s contribution. Lower = needs more trees but generalizes better. Set to 0.05–0.1 and tune n_estimators.max_depth/num_leaves: controls tree complexitysubsampleandcolsample_bytree: randomness that prevents overfitting
AdaBoost
The historical precursor to gradient boosting. Instead of fitting residuals, AdaBoost assigns exponentially higher weights to misclassified examples after each round, forcing subsequent models to focus on hard cases.
Used most successfully in computer vision applications, particularly face detection (the Viola-Jones algorithm). Largely superseded by gradient boosting for general tabular tasks.
Stacking
Stacking is meta-learning: train several base models on the training data, then train a “meta-model” on their predictions.
The meta-model learns how to optimally combine the base models. If one base model is better on a particular region of input space, the meta-model learns to weight it more in that region.
Key rule: base models must make predictions on held-out data (cross-validation predictions) before feeding to the meta-model. Otherwise, the meta-model just learns to trust the most overfit base model.
The more different the base models, the better stacking works. Combining XGBoost, LightGBM, a linear model, and a neural network typically outperforms any single one.
Cascading
A practical pattern for high-stakes decisions: chain models where early-stage simple models filter easy cases, and later-stage complex (or human) models handle ambiguous ones.
Example: fraud detection pipeline
- Simple rule: flag transactions over $10,000. Route to human.
- Logistic regression: flag >90% fraud probability. Block.
- LightGBM: score remaining; flag >70% fraud. Review queue.
- Analyst: manual review of residual high-risk transactions.
Cascading trades off latency for accuracy on hard cases. The simple models handle 90%+ of volume cheaply; expensive models apply only where they add value.
Choosing Between Tree Methods
| Situation | Use |
|---|---|
| First baseline on tabular data | Random Forest (robust, minimal tuning) |
| Maximum accuracy on tabular data | XGBoost or LightGBM with tuning |
| Large dataset (>1M rows) | LightGBM (fastest) |
| Interpretability required | Single decision tree or logistic regression |
| Many categorical features | LightGBM (native categorical support) |
| Imbalanced classes | Gradient boosting with scale_pos_weight |
The reason XGBoost and LightGBM dominate tabular ML is not mysterious: boosting efficiently reduces bias by targeting residuals, subsampling and regularization control variance, and the result generalizes better than any individual model. That’s the whole story.