Feature engineering is the work of transforming raw data into the representations a model can use to learn the patterns that matter. It’s also one of the least glamorous and most impactful things you can do in a data science project.
A dataset of raw transactions is not useful to a classifier. A dataset of engineered features — recency, frequency, monetary value; time since last purchase; category distribution; anomaly flags — can be powerful. The model doesn’t create these insights. You do.
Why Features Matter More Than Models
The choice of model (XGBoost vs. LightGBM vs. neural network) matters far less than the quality of your feature representation. Here’s why:
A gradient boosted tree with well-engineered features will consistently outperform a deep neural network on tabular data with poor features. The model is a lens — it can only focus on what you put in front of it. If your features don’t contain the relevant information, no model complexity will recover it.
The corollary: feature engineering is where domain knowledge becomes competitive advantage. A data scientist who understands the supply chain can engineer features that capture supplier lead-time variability, seasonal demand patterns, and stockout risk in ways that a generalist cannot. The model doesn’t know what a supply chain is. You do.
The Taxonomy of Feature Types
Raw Features
The original columns in your dataset. Sometimes useful directly; more often, they need transformation.
Transformed Features
Applying mathematical transformations to raw features:
- Log transform: stabilizes variance for right-skewed distributions (sales volume, prices, counts).
log(1 + x)handles zero values. - Box-Cox: a family of power transforms that includes log as a special case, with an optimizable lambda.
- Standardization: zero mean, unit variance. Required for distance-based methods (KNN, SVM). Not strictly necessary for tree-based methods.
- Min-max scaling: scales to [0, 1]. Sensitive to outliers — use robust scaling (IQR-based) for noisy data.
Interaction Features
Products and ratios of two or more features:
- Price × quantity = revenue
- Sales / capacity = utilization rate
- Current inventory / average weekly demand = weeks of cover
Interaction features encode relationships that linear models cannot learn but often capture economically meaningful quantities.
Time-Based Features
Essential for any dataset with a time dimension:
- Day of week, month, quarter, year
- Is it a holiday? Is it the day before a holiday?
- Days since an event (last purchase, last restock, last price change)
- Rolling statistics: 7-day average, 30-day standard deviation, 90-day trend slope
For time-series prediction, the most valuable features are often lagged values of the target: yesterday’s sales, last week’s sales, same day last year’s sales.
Watch out for leakage: Features that use information that wouldn’t have been available at prediction time. In time-series, this means always using lags, not the current value.
Aggregation Features
Summarizing information across groups:
- Average purchase value per customer
- Variance of sales across stores in the same region
- Count of anomalies in the last 30 days per sensor
Aggregation features are how you encode population-level information into row-level models.
Text Features
Transforming unstructured text into numeric representations:
- TF-IDF: term frequency × inverse document frequency — the classic baseline
- Bag of words: count matrix of word occurrences
- Word embeddings: dense vector representations that capture semantic similarity
- Sentence transformers: pre-trained models that embed full sentences into vectors
The choice depends on what the text is capturing. Short, keyword-rich text (product descriptions, search queries) works well with TF-IDF. Long-form text with semantic content benefits from embedding models.
Categorical Encoding
How you encode categorical features matters:
- One-hot encoding: appropriate for low-cardinality categoricals with no ordinal relationship. Creates sparse features.
- Label encoding: appropriate for ordinal categoricals (low/medium/high → 1/2/3) or for tree-based models that can exploit ordinal splits.
- Target encoding: replace the category with the mean target value for that category. Powerful but leakage-prone — use cross-fitting or out-of-fold encoding.
- Hash encoding: handles high-cardinality categoricals (millions of unique values) by hashing to a fixed-size space.
Feature Selection
Not all engineered features are useful. Adding irrelevant features:
- Increases training time
- Increases model complexity (risk of overfitting)
- Can actually hurt performance by adding noise the model must learn to ignore
Selection methods:
Filter methods: Compute a metric (correlation, mutual information, variance) for each feature independently of the model. Fast, model-agnostic, but ignores interactions.
Wrapper methods: Use the model itself to evaluate feature subsets. Recursive Feature Elimination (RFE) is the canonical example. Slower but accounts for the model’s actual behavior.
Embedded methods: Feature selection built into the model training. L1 regularization (Lasso) for linear models drives coefficients to zero for unimportant features. Feature importance from tree-based models gives an embedded ranking.
Practical heuristic: Start with all features, train a LightGBM or XGBoost model, and look at the feature importance plot. Features with near-zero importance are candidates for removal — but verify by retraining without them and comparing performance.
Time-Series Feature Engineering
Time-series forecasting requires special care:
Lag features: The most straightforward temporal features. lag_1 is the value one period ago, lag_7 is the value one week ago. Essential for capturing auto-correlation in the target.
Rolling statistics: Window aggregates that move forward in time.
- Rolling mean: smoothed trend
- Rolling standard deviation: local volatility
- Rolling min/max: recent range
- Rolling autocorrelation: how correlated is the series with itself at lag k?
Calendar features: Captures seasonal patterns.
- Day of week (Monday vs. Friday patterns)
- Week of year (holiday seasonality)
- Month of year (annual seasonality)
- Is-holiday, days-until-holiday, days-since-holiday
Fourier features: For strong periodic patterns (daily, weekly, annual cycles), Fourier series decomposition can capture the periodicity more cleanly than raw calendar features.
Trend features: Slope of the rolling mean over a window. Positive trend, negative trend, flat — encoded as a feature.
Cross-series features: In panel data (many time series), aggregate across related series. The average sales of products in the same category, weighted by historical similarity, often improves forecasts for individual series.
The Training/Serving Skew Problem
The biggest practical failure mode in feature engineering is training/serving skew: the features you compute at training time don’t match what you compute at serving time.
This happens because:
- Training features are computed offline with the full historical dataset
- Serving features are computed online with only what’s currently available
- The code paths diverge and subtle differences accumulate
Mitigation:
- Use a feature store — a single, versioned system for computing and storing features that serves both training and inference
- Write feature computation as pure functions with explicit inputs and no hidden state
- Test your serving code with held-out historical periods and compare to training features
This is one of the most expensive mistakes to discover in production.
Case Study: Supply Chain Demand Forecasting
At Blue Yonder, feature engineering for store-level demand forecasting involved:
Base features: SKU, store, date, historical sales at multiple lags (1, 7, 14, 28, 365 days)
Seasonal features: Day of week, week of year, month, year, days until next holiday, is-payday-week
Trend features: 7-day rolling mean, 28-day rolling mean, year-over-year growth rate
Store-level features: Store size, store format, store location cluster, average basket size in the last 30 days
SKU-level features: SKU category, price tier, perishability flag, new-product flag (product less than 90 days since launch), lifecycle stage
Promotion features: Is there an active promotion? What is the discount depth? How many days since the last promotion?
External features: Weather (temperature, precipitation), competitor pricing changes, out-of-stock events in previous week
The engineered feature set was 200+ columns from 8 raw data sources. The most important insight from feature importance analysis: lag features and promotion features dominated. Calendar features mattered for seasonality but less for week-to-week variance.
The Process
Feature engineering is iterative:
- Start with the simplest possible feature set (lags, calendar, category flags)
- Train a baseline model, evaluate, and identify where it fails
- Hypothesize features that would help the model in those failure cases
- Engineer the feature, add it, retrain, and measure the delta
- Remove features that don’t contribute — keep the feature set lean
Don’t engineer 200 features before training anything. Understand the baseline first, then target improvements.
The right way to think about it: each feature is a hypothesis about what information is relevant to the prediction. Feature engineering is hypothesis testing.