Demand and Inventory Forecasting at Scale

Overview

At Blue Yonder, I worked on a 15-person team building cloud-native SaaS ML solutions for enterprise omnichannel supply-chain decision-making. The scale was unlike anything I’d worked on before: 5TB+ of live, noisy, high-dimensional retail and logistics data, with real-time forecasts driving operational decisions for some of the world’s largest retailers.

My contributions spanned 7 forecasting problem types — from store fulfillment capacity to markdown optimization — each with its own data characteristics, business objectives, and production constraints.

The Problem

Retail supply chains are decisions stacked on decisions: how much inventory to hold, when to replenish, how much capacity to reserve for fulfillment, when and how deeply to discount aging inventory. Each of these decisions had historically been made with siloed, ad-hoc models or manual rules that couldn’t adapt to changing demand patterns, stockouts, or external events.

The challenge wasn’t just modeling accuracy — it was building systems that could ingest 5TB+ of data, produce reliable predictions within latency budgets, and be understood by business stakeholders who needed to act on the outputs.

Why It Mattered

For enterprise retailers operating at scale, a 1% improvement in fulfillment capacity utilization or a 2% reduction in markdowns translates to tens of millions of dollars in impact. Wrong inventory positions create stockouts or overstock — both costly. Inaccurate markdown timing means leaving money on the table or burning margin unnecessarily. The forecasts weren’t academic — every output was connected to an operational decision.

Data & Inputs

Historical sales data: SKU-level, store-level, timestamped — 5TB+ raw
Inventory levels, purchase orders, supplier lead times
Fulfillment center capacity logs — historical and real-time
Return rates and reverse logistics patterns
External data: calendar events, promotions, competitor pricing signals
Demand signals: clickstream, web traffic, search trends (where available)

Data quality was a first-class problem. Missing values, encoding errors, outliers from one-off events, and inconsistent SKU definitions across client data sources were the norm, not the exception.

Approach

Each forecasting problem required its own modeling strategy:

Store Fulfillment Capacity: Time-series forecasting (LSTM + XGBoost ensemble) to predict capacity demand 2–14 days ahead. Feature engineering on historical utilization patterns, day-of-week effects, and event flags.

Delivery Date Estimation: Regression with uncertainty bounds. Critical: a confident wrong estimate is worse than an honest uncertain one. Used quantile regression to output confidence intervals, not point estimates.

Sales Returns Forecasting: Category-specific models — return rates for electronics differ structurally from apparel. Hierarchical models with product hierarchy as structure.

Replenishment Forecasting: Joint demand and lead-time modeling. The hard part: replenishment decisions need to account for supplier variability, not just demand variability.

Inventory Estimation: State estimation with Kalman filter components — tracking inventory levels between physical counts using transaction data.

Markdown Optimization: Framed as a price-response problem. Trained demand-at-price models, then used optimization to find the markdown timing and depth that maximizes revenue recovery.

Stockout Avoidance: Binary classification + threshold tuning. High recall requirement — a missed stockout is more costly than a false alarm.

Engineering & Implementation

The production stack was built for scale:

Data pipeline: Apache Beam on Google Cloud Dataflow for bulk ingestion and feature engineering — parallel processing of multi-terabyte data at production volumes
Training pipeline: TensorFlow Extended (TFX) with Kubeflow orchestration — reproducible, versioned, monitored model training
Serving: BigQuery for batch inference; auto-scaled Kubernetes for real-time endpoints
Feature store: centralized feature store to avoid training/serving skew — one of the most important architectural decisions
Model versioning: Continuous deployment with canary rollouts — new model versions served to 5% of traffic before full promotion
Monitoring: Data drift detection, prediction distribution monitoring, business metric tracking

The team was 15+ engineers and ML practitioners. Working at this scale required discipline around interfaces, testing, and documentation that smaller teams don’t always need.

Results & Impact

7 production models deployed across supply-chain verticals
5TB+ data processed reliably through production pipelines
Forecasting systems serving enterprise retail clients at scale via SaaS platform
Improved fulfillment capacity utilization and reduced inventory holding costs across client base
Personal contributions: delivery date estimation, replenishment forecasting, and markdown optimization models

Limitations & What I’d Do Differently

Hierarchical reconciliation between SKU-level and store-level forecasts was handled differently per model rather than systematically — this created inconsistencies at aggregation. A unified hierarchical forecasting framework (like MINT or similar) would have been cleaner.

The feature store was valuable but expensive to maintain. In retrospect, tighter discipline on which features actually moved the needle would have reduced complexity.

Stack

Python, TensorFlow, Keras, TFX, Kubeflow, Apache Beam, Google Dataflow, BigQuery, Kubernetes, XGBoost, LightGBM, Scikit-learn

Overview

The Problem

Why It Mattered

Data & Inputs

Approach

Engineering & Implementation

Results & Impact

Limitations & What I’d Do Differently

Stack

Related Writing

Stack

Lets collaborate!