Writing machine learning
machine-learning 14 min read 15 January 2024

Feature Engineering: The Skill That Separates Good Models from Bad Ones

A practitioner's guide to feature engineering — the craft of transforming raw data into model-ready representations that capture what actually matters for the prediction task.

Feature engineering is the work of transforming raw data into the representations a model can use to learn the patterns that matter. It’s also one of the least glamorous and most impactful things you can do in a data science project.

A dataset of raw transactions is not useful to a classifier. A dataset of engineered features — recency, frequency, monetary value; time since last purchase; category distribution; anomaly flags — can be powerful. The model doesn’t create these insights. You do.

Why Features Matter More Than Models

The choice of model (XGBoost vs. LightGBM vs. neural network) matters far less than the quality of your feature representation. Here’s why:

A gradient boosted tree with well-engineered features will consistently outperform a deep neural network on tabular data with poor features. The model is a lens — it can only focus on what you put in front of it. If your features don’t contain the relevant information, no model complexity will recover it.

The corollary: feature engineering is where domain knowledge becomes competitive advantage. A data scientist who understands the supply chain can engineer features that capture supplier lead-time variability, seasonal demand patterns, and stockout risk in ways that a generalist cannot. The model doesn’t know what a supply chain is. You do.

The Taxonomy of Feature Types

Raw Features

The original columns in your dataset. Sometimes useful directly; more often, they need transformation.

Transformed Features

Applying mathematical transformations to raw features:

Interaction Features

Products and ratios of two or more features:

Interaction features encode relationships that linear models cannot learn but often capture economically meaningful quantities.

Time-Based Features

Essential for any dataset with a time dimension:

For time-series prediction, the most valuable features are often lagged values of the target: yesterday’s sales, last week’s sales, same day last year’s sales.

Watch out for leakage: Features that use information that wouldn’t have been available at prediction time. In time-series, this means always using lags, not the current value.

Aggregation Features

Summarizing information across groups:

Aggregation features are how you encode population-level information into row-level models.

Text Features

Transforming unstructured text into numeric representations:

The choice depends on what the text is capturing. Short, keyword-rich text (product descriptions, search queries) works well with TF-IDF. Long-form text with semantic content benefits from embedding models.

Categorical Encoding

How you encode categorical features matters:

Feature Selection

Not all engineered features are useful. Adding irrelevant features:

Selection methods:

Filter methods: Compute a metric (correlation, mutual information, variance) for each feature independently of the model. Fast, model-agnostic, but ignores interactions.

Wrapper methods: Use the model itself to evaluate feature subsets. Recursive Feature Elimination (RFE) is the canonical example. Slower but accounts for the model’s actual behavior.

Embedded methods: Feature selection built into the model training. L1 regularization (Lasso) for linear models drives coefficients to zero for unimportant features. Feature importance from tree-based models gives an embedded ranking.

Practical heuristic: Start with all features, train a LightGBM or XGBoost model, and look at the feature importance plot. Features with near-zero importance are candidates for removal — but verify by retraining without them and comparing performance.

Time-Series Feature Engineering

Time-series forecasting requires special care:

Lag features: The most straightforward temporal features. lag_1 is the value one period ago, lag_7 is the value one week ago. Essential for capturing auto-correlation in the target.

Rolling statistics: Window aggregates that move forward in time.

Calendar features: Captures seasonal patterns.

Fourier features: For strong periodic patterns (daily, weekly, annual cycles), Fourier series decomposition can capture the periodicity more cleanly than raw calendar features.

Trend features: Slope of the rolling mean over a window. Positive trend, negative trend, flat — encoded as a feature.

Cross-series features: In panel data (many time series), aggregate across related series. The average sales of products in the same category, weighted by historical similarity, often improves forecasts for individual series.

The Training/Serving Skew Problem

The biggest practical failure mode in feature engineering is training/serving skew: the features you compute at training time don’t match what you compute at serving time.

This happens because:

Mitigation:

  1. Use a feature store — a single, versioned system for computing and storing features that serves both training and inference
  2. Write feature computation as pure functions with explicit inputs and no hidden state
  3. Test your serving code with held-out historical periods and compare to training features

This is one of the most expensive mistakes to discover in production.

Case Study: Supply Chain Demand Forecasting

At Blue Yonder, feature engineering for store-level demand forecasting involved:

Base features: SKU, store, date, historical sales at multiple lags (1, 7, 14, 28, 365 days)

Seasonal features: Day of week, week of year, month, year, days until next holiday, is-payday-week

Trend features: 7-day rolling mean, 28-day rolling mean, year-over-year growth rate

Store-level features: Store size, store format, store location cluster, average basket size in the last 30 days

SKU-level features: SKU category, price tier, perishability flag, new-product flag (product less than 90 days since launch), lifecycle stage

Promotion features: Is there an active promotion? What is the discount depth? How many days since the last promotion?

External features: Weather (temperature, precipitation), competitor pricing changes, out-of-stock events in previous week

The engineered feature set was 200+ columns from 8 raw data sources. The most important insight from feature importance analysis: lag features and promotion features dominated. Calendar features mattered for seasonality but less for week-to-week variance.

The Process

Feature engineering is iterative:

  1. Start with the simplest possible feature set (lags, calendar, category flags)
  2. Train a baseline model, evaluate, and identify where it fails
  3. Hypothesize features that would help the model in those failure cases
  4. Engineer the feature, add it, retrain, and measure the delta
  5. Remove features that don’t contribute — keep the feature set lean

Don’t engineer 200 features before training anything. Understand the baseline first, then target improvements.

The right way to think about it: each feature is a hypothesis about what information is relevant to the prediction. Feature engineering is hypothesis testing.

feature-engineering machine-learning data-science time-series
← All articles

Lets collaborate!

Whether you need a quantitative researcher, an machine learning systems builder, or a technical advisor — I'm available for select consulting engagements.

Get in Touch →