Machine learning is a large field with a lot of jargon. Before you get lost in the details of any specific algorithm, it helps to have a map of the territory — one that tells you what problems exist, what algorithm families address them, and what all machine learning systems have in common.
This is that map.
The Four Universal Components
Every machine learning problem, regardless of domain or complexity, reduces to the same four elements:
1. Data. The raw material. Its quality determines the ceiling of what any model can achieve. More data generally helps, but only if it’s the right data — representative, clean, and measured at the right granularity.
2. Model. The computational machinery that maps inputs to outputs. A model is a family of functions parameterized by learnable weights.
3. Objective function. The mathematical definition of “good.” Also called the loss function, it quantifies how wrong the model is on training data. The choice of objective encodes your assumptions about what errors matter and how much.
4. Optimization algorithm. The procedure that adjusts model parameters to minimize the objective. Gradient descent and its variants dominate modern ML.
Understanding ML means understanding how these four components interact. When a model underperforms, the problem is almost always traceable to one of them: wrong data, wrong model family, wrong objective, or optimization failure.
Data: What Makes It Useful
Dimensionality
The number of features per example. High-dimensional data is expensive to learn from — the volume of space grows exponentially with dimensions (the curse of dimensionality). Feature selection and dimensionality reduction address this.
Amount
More data reduces variance and relaxes the need for strong assumptions. Deep learning requires large datasets; traditional methods often work well on thousands of examples where deep learning fails to generalize.
Quality
Garbage in, garbage out. Bad data takes several forms:
- Measurement error: features that don’t reflect the real underlying quantity
- Label noise: incorrectly labeled training examples
- Representation bias: training data that systematically underrepresents certain populations
- Distribution shift: training data that doesn’t match the deployment distribution
Data Types
- Nominal/Categorical: unordered, no numeric distance. Requires encoding.
- Ordinal: ordered but distances aren’t meaningful. Handle carefully — numbering ordinal classes doesn’t make them numeric.
- Interval: ordered, equal distances, no meaningful zero.
- Ratio: ordered, equal distances, meaningful zero (sales volume, counts, time).
Problem Taxonomy
Supervised Learning
You have labeled examples (input → output pairs). The model learns to predict the output for new inputs.
- Regression: the output is a continuous value. House price prediction, demand forecasting, temperature prediction.
- Classification: the output is a category. Spam detection, image classification, churn prediction.
- Tagging/sequence labeling: each element of a sequence gets a label. Named entity recognition, part-of-speech tagging.
- Ranking: order a set of items by relevance. Search engines, recommendation systems.
Unsupervised Learning
No labels. The model finds structure in the data.
- Clustering: group similar examples. Customer segmentation, anomaly detection, document grouping.
- Dimensionality reduction: compress high-dimensional data into fewer dimensions while preserving structure. Visualization, feature extraction, compression.
- Generative modeling: learn the data distribution well enough to sample from it. Useful for data augmentation and understanding.
Reinforcement Learning
An agent learns by interacting with an environment. It receives rewards for good actions and learns a policy that maximizes cumulative reward. Game playing, robotics, recommendation personalization.
Algorithm Families
Regression Models
Linear regression models output as a weighted sum of inputs. Despite being simple, linear models are powerful baselines and often competitive when features are engineered well.
Key variants:
- Ordinary Least Squares: minimizes squared error
- Ridge (L2 regularization): shrinks weights toward zero, handles multicollinearity
- Lasso (L1 regularization): induces sparsity, performs implicit feature selection
- Elastic Net: combines L1 and L2, balances sparsity and grouping of correlated features
- Logistic Regression: linear model for classification, outputs a probability via sigmoid
Tree-Based Models
Partition the feature space with axis-aligned splits. Interpretable, handle mixed data types, require no feature scaling.
- Decision Trees: single tree, prone to overfitting at depth
- Random Forest: ensemble of trees with row and column sampling, reduces variance
- Gradient Boosted Trees (XGBoost, LightGBM): trees added sequentially to correct residual errors, state of the art on tabular data
Distance-Based Models
- k-Nearest Neighbors: classify by majority vote of k nearest training examples. Simple, no training phase, expensive at inference.
- Support Vector Machines: find the maximum-margin hyperplane separating classes. With kernels, can handle nonlinear boundaries.
Probabilistic Models
- Naive Bayes: assumes features are conditionally independent given the class. Fast, interpretable, surprisingly effective for text.
- Bayesian Networks: encode probabilistic dependencies between variables as a directed acyclic graph.
Neural Networks
Universal function approximators composed of stacked nonlinear transformations. Excel on unstructured data (images, text, audio). Require large datasets and significant compute.
Clustering Algorithms
- k-Means: partition data into k clusters by minimizing within-cluster variance
- DBSCAN: density-based, discovers clusters of arbitrary shape, robust to outliers
- Hierarchical Clustering: builds a tree of nested clusters, doesn’t require pre-specifying k
Dimensionality Reduction
- PCA: projects data onto directions of maximum variance. Linear.
- t-SNE: nonlinear, good for visualization. Doesn’t preserve global structure.
- UMAP: faster than t-SNE, better preserves global structure.
The Bias-Variance Tradeoff
Every model makes a tradeoff between two types of error:
Bias is systematic error from oversimplified assumptions. A linear model fit to nonlinear data has high bias — it can’t capture the pattern regardless of how much data you give it.
Variance is sensitivity to the specific training set. A deep tree memorizes training data perfectly but fails on new examples — it has high variance.
The total expected error is:
Error = Bias² + Variance + Irreducible Noise
You can’t eliminate irreducible noise. The goal is to find the model complexity that minimizes the sum of bias and variance.
High bias (underfitting): model is too simple. Fix by adding complexity, engineering richer features, or choosing a more flexible model family.
High variance (overfitting): model is too complex. Fix by regularization, more data, dropout, early stopping, or ensembling.
When to Use What
| Situation | Recommended approach |
|---|---|
| Tabular data, medium scale | Gradient boosted trees (XGBoost/LightGBM) |
| Tabular data, small scale | Linear models or random forests |
| Need interpretability | Linear models or single decision trees |
| Images, audio, video | CNNs, Transformers |
| Text, sequences | Transformers, RNNs |
| Clustering without labels | k-Means (convex clusters), DBSCAN (arbitrary shape) |
| High-dimensional features | PCA first, then any model |
| Very little data | Linear models with strong regularization |
| Lots of missing data | Tree-based models (handle missing values natively) |
The map doesn’t make decisions for you — domain knowledge and experimentation do. But having the map means you start in the right neighborhood.