Machine Learning

Most Machine Learning work fails for reasons that have nothing to do with the model. It fails in the features, in the validation, and in the gap between a notebook that scores well and a system that holds up on next week's data. These pieces are the working knowledge behind models that actually run.

The pipeline is the product

A model is one component in a longer chain. I map that chain in The 8-Layer Data Science Pipeline and Machine Learning Taxonomy and Building Blocks: problem framing, data, features, model, evaluation, and the decision the output is meant to inform. Most teams over-invest in the model layer and under-invest in the two layers on either side of it.

Where models actually break

If I had to name the one skill that separates good models from bad ones, it is feature engineering, which is why I wrote Feature Engineering: The Skill That Separates Good Models from Bad Ones. The second is honest validation. Overfitting Is Not a Model Problem, It's a Thinking Problem argues that overfitting is a discipline failure before it is a technical one, and Losses and Metrics in Machine Learning is about choosing the objective that matches the decision, not the one that is convenient to optimize.

Foundations worth having cold

Good intuition for the core methods pays off everywhere. I have written first-principles explanations of linear models, decision trees and ensembles, support vector machines, clustering, dimensionality reduction, and the neural network training playbook. The common thread is geometry and assumptions first, library calls second.

Thinking in probability, and explaining the result

Two habits separate practitioners from people who run scripts. The first is treating probability as an operating system for decisions rather than a formula applied at the end. The second is being able to say why a model did what it did, the subject of Explainable AI in Practice. For specialized problems I have also written on anomaly detection, count data and probabilistic forecasting, scaling across data and compute, and deeper dives like differentiation in TensorFlow, deep learning for image tasks, and useful concepts such as calibration and RANSAC.

This is the work behind Production Machine Learning & Data Infrastructure, and you can see it running in production in the supply-chain forecasting case study.

Machine Learning

The pipeline is the product

Where models actually break

Foundations worth having cold

Thinking in probability, and explaining the result

All articles in this topic

Anomaly Detection: A Practical Framework

Explainable AI in Practice

Losses and Metrics in Machine Learning

Neural Network Training Playbook

Overfitting Is Not a Model Problem, It's a Thinking Problem

Count Data Models and Probabilistic Forecasting

Probability as an Operating System for Better Decisions

Decision Trees and Ensembles: Intuition First

Machine Learning Taxonomy and Building Blocks

Feature Engineering: The Skill That Separates Good Models from Bad Ones

The 8-Layer Data Science Pipeline

Deep Learning for Image Tasks: Detection vs. Segmentation

Differentiation in TensorFlow: GradientTape and Custom Training Loops

Scaling Machine Learning: Data, Compute, and Systems

Useful Machine Learning Concepts: Calibration, RANSAC, and the Loss Minimization Framework

Dimensionality Reduction: PCA, t-SNE, and UMAP

Clustering: Algorithms, Tradeoffs, and When to Use Each

Support Vector Machines: Geometry, Kernels, and Practical Tradeoffs

Linear Models: Regression, Loss Functions, and the Gaussian Assumption

Related case studies

7 Production Forecasting Models Driving Replenishment & Markdown Decisions for Blue Yonder's Enterprise Retailers (5TB+)

Temporal Attention for Link Prediction on Dynamic Graphs - 86% AUC at NUS

Predicting Missing Friendship Links from Social Graph Structure

Detecting Semantically Duplicate Questions Despite Different Wording (Quora Question Pairs)

Classifying Cancer Mutations from Clinical Text (MSKCC Challenge)

Why TCIA Cancer Imaging Won't Carry a Clinical Screening Tool: A Feasibility Study

Predicting Hydroponic Crop Yield from Sensor Data - and Turning It Into Planting Decisions

Have a problem worth solving?