Writing ml
ml 12 min read 20 January 2024

ML Taxonomy and Building Blocks

A reference-first guide to the full landscape of machine learning — problem types, algorithm families, and the four universal components that every ML system shares.

Machine learning is a large field with a lot of jargon. Before you get lost in the details of any specific algorithm, it helps to have a map of the territory — one that tells you what problems exist, what algorithm families address them, and what all machine learning systems have in common.

This is that map.

The Four Universal Components

Every machine learning problem, regardless of domain or complexity, reduces to the same four elements:

1. Data. The raw material. Its quality determines the ceiling of what any model can achieve. More data generally helps, but only if it’s the right data — representative, clean, and measured at the right granularity.

2. Model. The computational machinery that maps inputs to outputs. A model is a family of functions parameterized by learnable weights.

3. Objective function. The mathematical definition of “good.” Also called the loss function, it quantifies how wrong the model is on training data. The choice of objective encodes your assumptions about what errors matter and how much.

4. Optimization algorithm. The procedure that adjusts model parameters to minimize the objective. Gradient descent and its variants dominate modern ML.

Understanding ML means understanding how these four components interact. When a model underperforms, the problem is almost always traceable to one of them: wrong data, wrong model family, wrong objective, or optimization failure.

Data: What Makes It Useful

Dimensionality

The number of features per example. High-dimensional data is expensive to learn from — the volume of space grows exponentially with dimensions (the curse of dimensionality). Feature selection and dimensionality reduction address this.

Amount

More data reduces variance and relaxes the need for strong assumptions. Deep learning requires large datasets; traditional methods often work well on thousands of examples where deep learning fails to generalize.

Quality

Garbage in, garbage out. Bad data takes several forms:

Data Types

Problem Taxonomy

Supervised Learning

You have labeled examples (input → output pairs). The model learns to predict the output for new inputs.

Unsupervised Learning

No labels. The model finds structure in the data.

Reinforcement Learning

An agent learns by interacting with an environment. It receives rewards for good actions and learns a policy that maximizes cumulative reward. Game playing, robotics, recommendation personalization.

Algorithm Families

Regression Models

Linear regression models output as a weighted sum of inputs. Despite being simple, linear models are powerful baselines and often competitive when features are engineered well.

Key variants:

Tree-Based Models

Partition the feature space with axis-aligned splits. Interpretable, handle mixed data types, require no feature scaling.

Distance-Based Models

Probabilistic Models

Neural Networks

Universal function approximators composed of stacked nonlinear transformations. Excel on unstructured data (images, text, audio). Require large datasets and significant compute.

Clustering Algorithms

Dimensionality Reduction

The Bias-Variance Tradeoff

Every model makes a tradeoff between two types of error:

Bias is systematic error from oversimplified assumptions. A linear model fit to nonlinear data has high bias — it can’t capture the pattern regardless of how much data you give it.

Variance is sensitivity to the specific training set. A deep tree memorizes training data perfectly but fails on new examples — it has high variance.

The total expected error is:

Error = Bias² + Variance + Irreducible Noise

You can’t eliminate irreducible noise. The goal is to find the model complexity that minimizes the sum of bias and variance.

High bias (underfitting): model is too simple. Fix by adding complexity, engineering richer features, or choosing a more flexible model family.

High variance (overfitting): model is too complex. Fix by regularization, more data, dropout, early stopping, or ensembling.

When to Use What

SituationRecommended approach
Tabular data, medium scaleGradient boosted trees (XGBoost/LightGBM)
Tabular data, small scaleLinear models or random forests
Need interpretabilityLinear models or single decision trees
Images, audio, videoCNNs, Transformers
Text, sequencesTransformers, RNNs
Clustering without labelsk-Means (convex clusters), DBSCAN (arbitrary shape)
High-dimensional featuresPCA first, then any model
Very little dataLinear models with strong regularization
Lots of missing dataTree-based models (handle missing values natively)

The map doesn’t make decisions for you — domain knowledge and experimentation do. But having the map means you start in the right neighborhood.

machine-learning supervised-learning unsupervised-learning algorithms fundamentals
← All articles

Lets collaborate!

Whether you need a quantitative researcher, an machine learning systems builder, or a technical advisor — I'm available for select consulting engagements.

Get in Touch →