Feature Engineering: The Skill That Separates Good Models from Bad Ones

Feature engineering is the work of transforming raw data into the representations a model can use to learn the patterns that matter. It’s also one of the least glamorous and most impactful things you can do in a data science project.

A dataset of raw transactions is not useful to a classifier. A dataset of engineered features — recency, frequency, monetary value; time since last purchase; category distribution; anomaly flags — can be powerful. The model doesn’t create these insights. You do.

Why Features Matter More Than Models

The choice of model (XGBoost vs. LightGBM vs. neural network) matters far less than the quality of your feature representation. Here’s why:

A gradient boosted tree with well-engineered features will consistently outperform a deep neural network on tabular data with poor features. The model is a lens — it can only focus on what you put in front of it. If your features don’t contain the relevant information, no model complexity will recover it.

The corollary: feature engineering is where domain knowledge becomes competitive advantage. A data scientist who understands the supply chain can engineer features that capture supplier lead-time variability, seasonal demand patterns, and stockout risk in ways that a generalist cannot. The model doesn’t know what a supply chain is. You do.

The Taxonomy of Feature Types

Raw Features

The original columns in your dataset. Sometimes useful directly; more often, they need transformation.

Transformed Features

Applying mathematical transformations to raw features:

Log transform: stabilizes variance for right-skewed distributions (sales volume, prices, counts). log(1 + x) handles zero values.
Box-Cox: a family of power transforms that includes log as a special case, with an optimizable lambda.
Standardization: zero mean, unit variance. Required for distance-based methods (KNN, SVM). Not strictly necessary for tree-based methods.
Min-max scaling: scales to [0, 1]. Sensitive to outliers — use robust scaling (IQR-based) for noisy data.

Interaction Features

Products and ratios of two or more features:

Price × quantity = revenue
Sales / capacity = utilization rate
Current inventory / average weekly demand = weeks of cover

Interaction features encode relationships that linear models cannot learn but often capture economically meaningful quantities.

Time-Based Features

Essential for any dataset with a time dimension:

Day of week, month, quarter, year
Is it a holiday? Is it the day before a holiday?
Days since an event (last purchase, last restock, last price change)
Rolling statistics: 7-day average, 30-day standard deviation, 90-day trend slope

For time-series prediction, the most valuable features are often lagged values of the target: yesterday’s sales, last week’s sales, same day last year’s sales.

Watch out for leakage: Features that use information that wouldn’t have been available at prediction time. In time-series, this means always using lags, not the current value.

Aggregation Features

Summarizing information across groups:

Average purchase value per customer
Variance of sales across stores in the same region
Count of anomalies in the last 30 days per sensor

Aggregation features are how you encode population-level information into row-level models.

Text Features

Transforming unstructured text into numeric representations:

TF-IDF: term frequency × inverse document frequency — the classic baseline
Bag of words: count matrix of word occurrences
Word embeddings: dense vector representations that capture semantic similarity
Sentence transformers: pre-trained models that embed full sentences into vectors

The choice depends on what the text is capturing. Short, keyword-rich text (product descriptions, search queries) works well with TF-IDF. Long-form text with semantic content benefits from embedding models.

Categorical Encoding

How you encode categorical features matters:

One-hot encoding: appropriate for low-cardinality categoricals with no ordinal relationship. Creates sparse features.
Label encoding: appropriate for ordinal categoricals (low/medium/high → 1/2/3) or for tree-based models that can exploit ordinal splits.
Target encoding: replace the category with the mean target value for that category. Powerful but leakage-prone — use cross-fitting or out-of-fold encoding.
Hash encoding: handles high-cardinality categoricals (millions of unique values) by hashing to a fixed-size space.

Feature Selection

Not all engineered features are useful. Adding irrelevant features:

Increases training time
Increases model complexity (risk of overfitting)
Can actually hurt performance by adding noise the model must learn to ignore

Selection methods:

Filter methods: Compute a metric (correlation, mutual information, variance) for each feature independently of the model. Fast, model-agnostic, but ignores interactions.

Wrapper methods: Use the model itself to evaluate feature subsets. Recursive Feature Elimination (RFE) is the canonical example. Slower but accounts for the model’s actual behavior.

Embedded methods: Feature selection built into the model training. L1 regularization (Lasso) for linear models drives coefficients to zero for unimportant features. Feature importance from tree-based models gives an embedded ranking.

Practical heuristic: Start with all features, train a LightGBM or XGBoost model, and look at the feature importance plot. Features with near-zero importance are candidates for removal — but verify by retraining without them and comparing performance.

Time-Series Feature Engineering

Time-series forecasting requires special care:

Lag features: The most straightforward temporal features. lag_1 is the value one period ago, lag_7 is the value one week ago. Essential for capturing auto-correlation in the target.

Rolling statistics: Window aggregates that move forward in time.

Rolling mean: smoothed trend
Rolling standard deviation: local volatility
Rolling min/max: recent range
Rolling autocorrelation: how correlated is the series with itself at lag k?

Calendar features: Captures seasonal patterns.

Day of week (Monday vs. Friday patterns)
Week of year (holiday seasonality)
Month of year (annual seasonality)
Is-holiday, days-until-holiday, days-since-holiday

Fourier features: For strong periodic patterns (daily, weekly, annual cycles), Fourier series decomposition can capture the periodicity more cleanly than raw calendar features.

Trend features: Slope of the rolling mean over a window. Positive trend, negative trend, flat — encoded as a feature.

Cross-series features: In panel data (many time series), aggregate across related series. The average sales of products in the same category, weighted by historical similarity, often improves forecasts for individual series.

The Training/Serving Skew Problem

The biggest practical failure mode in feature engineering is training/serving skew: the features you compute at training time don’t match what you compute at serving time.

This happens because:

Training features are computed offline with the full historical dataset
Serving features are computed online with only what’s currently available
The code paths diverge and subtle differences accumulate

Mitigation:

Use a feature store — a single, versioned system for computing and storing features that serves both training and inference
Write feature computation as pure functions with explicit inputs and no hidden state
Test your serving code with held-out historical periods and compare to training features

This is one of the most expensive mistakes to discover in production.

Case Study: Supply Chain Demand Forecasting

At Blue Yonder, feature engineering for store-level demand forecasting involved:

Base features: SKU, store, date, historical sales at multiple lags (1, 7, 14, 28, 365 days)

Seasonal features: Day of week, week of year, month, year, days until next holiday, is-payday-week

Trend features: 7-day rolling mean, 28-day rolling mean, year-over-year growth rate

Store-level features: Store size, store format, store location cluster, average basket size in the last 30 days

SKU-level features: SKU category, price tier, perishability flag, new-product flag (product less than 90 days since launch), lifecycle stage

Promotion features: Is there an active promotion? What is the discount depth? How many days since the last promotion?

External features: Weather (temperature, precipitation), competitor pricing changes, out-of-stock events in previous week

The engineered feature set was 200+ columns from 8 raw data sources. The most important insight from feature importance analysis: lag features and promotion features dominated. Calendar features mattered for seasonality but less for week-to-week variance.

The Process

Feature engineering is iterative:

Start with the simplest possible feature set (lags, calendar, category flags)
Train a baseline model, evaluate, and identify where it fails
Hypothesize features that would help the model in those failure cases
Engineer the feature, add it, retrain, and measure the delta
Remove features that don’t contribute — keep the feature set lean

Don’t engineer 200 features before training anything. Understand the baseline first, then target improvements.

The right way to think about it: each feature is a hypothesis about what information is relevant to the prediction. Feature engineering is hypothesis testing.