ML Taxonomy and Building Blocks

Machine learning is a large field with a lot of jargon. Before you get lost in the details of any specific algorithm, it helps to have a map of the territory — one that tells you what problems exist, what algorithm families address them, and what all machine learning systems have in common.

This is that map.

The Four Universal Components

Every machine learning problem, regardless of domain or complexity, reduces to the same four elements:

1. Data. The raw material. Its quality determines the ceiling of what any model can achieve. More data generally helps, but only if it’s the right data — representative, clean, and measured at the right granularity.

2. Model. The computational machinery that maps inputs to outputs. A model is a family of functions parameterized by learnable weights.

3. Objective function. The mathematical definition of “good.” Also called the loss function, it quantifies how wrong the model is on training data. The choice of objective encodes your assumptions about what errors matter and how much.

4. Optimization algorithm. The procedure that adjusts model parameters to minimize the objective. Gradient descent and its variants dominate modern ML.

Understanding ML means understanding how these four components interact. When a model underperforms, the problem is almost always traceable to one of them: wrong data, wrong model family, wrong objective, or optimization failure.

Data: What Makes It Useful

Dimensionality

The number of features per example. High-dimensional data is expensive to learn from — the volume of space grows exponentially with dimensions (the curse of dimensionality). Feature selection and dimensionality reduction address this.

Amount

More data reduces variance and relaxes the need for strong assumptions. Deep learning requires large datasets; traditional methods often work well on thousands of examples where deep learning fails to generalize.

Quality

Garbage in, garbage out. Bad data takes several forms:

Measurement error: features that don’t reflect the real underlying quantity
Label noise: incorrectly labeled training examples
Representation bias: training data that systematically underrepresents certain populations
Distribution shift: training data that doesn’t match the deployment distribution

Data Types

Nominal/Categorical: unordered, no numeric distance. Requires encoding.
Ordinal: ordered but distances aren’t meaningful. Handle carefully — numbering ordinal classes doesn’t make them numeric.
Interval: ordered, equal distances, no meaningful zero.
Ratio: ordered, equal distances, meaningful zero (sales volume, counts, time).

Problem Taxonomy

Supervised Learning

You have labeled examples (input → output pairs). The model learns to predict the output for new inputs.

Regression: the output is a continuous value. House price prediction, demand forecasting, temperature prediction.
Classification: the output is a category. Spam detection, image classification, churn prediction.
Tagging/sequence labeling: each element of a sequence gets a label. Named entity recognition, part-of-speech tagging.
Ranking: order a set of items by relevance. Search engines, recommendation systems.

Unsupervised Learning

No labels. The model finds structure in the data.

Clustering: group similar examples. Customer segmentation, anomaly detection, document grouping.
Dimensionality reduction: compress high-dimensional data into fewer dimensions while preserving structure. Visualization, feature extraction, compression.
Generative modeling: learn the data distribution well enough to sample from it. Useful for data augmentation and understanding.

Reinforcement Learning

An agent learns by interacting with an environment. It receives rewards for good actions and learns a policy that maximizes cumulative reward. Game playing, robotics, recommendation personalization.

Algorithm Families

Regression Models

Linear regression models output as a weighted sum of inputs. Despite being simple, linear models are powerful baselines and often competitive when features are engineered well.

Key variants:

Ordinary Least Squares: minimizes squared error
Ridge (L2 regularization): shrinks weights toward zero, handles multicollinearity
Lasso (L1 regularization): induces sparsity, performs implicit feature selection
Elastic Net: combines L1 and L2, balances sparsity and grouping of correlated features
Logistic Regression: linear model for classification, outputs a probability via sigmoid

Tree-Based Models

Partition the feature space with axis-aligned splits. Interpretable, handle mixed data types, require no feature scaling.

Decision Trees: single tree, prone to overfitting at depth
Random Forest: ensemble of trees with row and column sampling, reduces variance
Gradient Boosted Trees (XGBoost, LightGBM): trees added sequentially to correct residual errors, state of the art on tabular data

Distance-Based Models

k-Nearest Neighbors: classify by majority vote of k nearest training examples. Simple, no training phase, expensive at inference.
Support Vector Machines: find the maximum-margin hyperplane separating classes. With kernels, can handle nonlinear boundaries.

Probabilistic Models

Naive Bayes: assumes features are conditionally independent given the class. Fast, interpretable, surprisingly effective for text.
Bayesian Networks: encode probabilistic dependencies between variables as a directed acyclic graph.

Neural Networks

Universal function approximators composed of stacked nonlinear transformations. Excel on unstructured data (images, text, audio). Require large datasets and significant compute.

Clustering Algorithms

k-Means: partition data into k clusters by minimizing within-cluster variance
DBSCAN: density-based, discovers clusters of arbitrary shape, robust to outliers
Hierarchical Clustering: builds a tree of nested clusters, doesn’t require pre-specifying k

Dimensionality Reduction

PCA: projects data onto directions of maximum variance. Linear.
t-SNE: nonlinear, good for visualization. Doesn’t preserve global structure.
UMAP: faster than t-SNE, better preserves global structure.

The Bias-Variance Tradeoff

Every model makes a tradeoff between two types of error:

Bias is systematic error from oversimplified assumptions. A linear model fit to nonlinear data has high bias — it can’t capture the pattern regardless of how much data you give it.

Variance is sensitivity to the specific training set. A deep tree memorizes training data perfectly but fails on new examples — it has high variance.

The total expected error is:

Error = Bias² + Variance + Irreducible Noise

You can’t eliminate irreducible noise. The goal is to find the model complexity that minimizes the sum of bias and variance.

High bias (underfitting): model is too simple. Fix by adding complexity, engineering richer features, or choosing a more flexible model family.

High variance (overfitting): model is too complex. Fix by regularization, more data, dropout, early stopping, or ensembling.

When to Use What

Situation	Recommended approach
Tabular data, medium scale	Gradient boosted trees (XGBoost/LightGBM)
Tabular data, small scale	Linear models or random forests
Need interpretability	Linear models or single decision trees
Images, audio, video	CNNs, Transformers
Text, sequences	Transformers, RNNs
Clustering without labels	k-Means (convex clusters), DBSCAN (arbitrary shape)
High-dimensional features	PCA first, then any model
Very little data	Linear models with strong regularization
Lots of missing data	Tree-based models (handle missing values natively)

The map doesn’t make decisions for you — domain knowledge and experimentation do. But having the map means you start in the right neighborhood.