Overview
Blue Yonder sells supply-chain planning software to some of the world’s largest retailers. When those retailers decide how much stock to reorder, when to replenish, and how deeply to mark down aging inventory, those decisions run on forecasts. My job was to build the forecasts.
5TB+ of live, noisy, high-dimensional retail and logistics data. Seven distinct forecasting problems, each with its own data characteristics, business objective, and production constraint. Enterprise clients whose operational decisions were driven directly by the outputs.
I was not building research prototypes. I was shipping forecasting systems that fed the planning loop for real retailers - fulfillment capacity, inventory allocation, markdown timing, replenishment. Each model had to run reliably, within latency budgets, and produce outputs that the SaaS platform’s planning engines and client stakeholders could act on without a data scientist in the loop.
The Problem
Retail supply chains are decisions stacked on decisions. How much inventory to hold. When to replenish. How much capacity to reserve for fulfillment. When and how deeply to discount aging inventory. Each of these decisions had historically been made with siloed, ad-hoc models or manual rules that could not adapt to changing demand patterns, stockouts, or external events.
The challenge was not just modeling accuracy. It was building systems that could ingest 5TB+ of data, produce reliable predictions within latency budgets, and communicate results clearly to business stakeholders who needed to act on the outputs.
Why It Mattered
For enterprise retailers operating at scale, the economics are unforgiving: small percentage shifts in fulfillment capacity utilization or markdown waste translate to large absolute numbers. Wrong inventory positions create stockouts or overstock - both costly. Inaccurate markdown timing means leaving money on the table or burning margin unnecessarily. The forecasts were not academic. Every output was wired to an operational decision a client would actually make, which set the bar for reliability and honesty about uncertainty far higher than a research setting would.
Data & Inputs
- Historical sales data: SKU-level, store-level, timestamped - 5TB+ raw
- Inventory levels, purchase orders, supplier lead times
- Fulfillment center capacity logs - historical and real-time
- Return rates and reverse logistics patterns
- External data: calendar events, promotions, competitor pricing signals
- Demand signals: clickstream, web traffic, search trends where available
Data quality was a first-class problem. Missing values, encoding errors, outliers from one-off events, and inconsistent SKU definitions across client data sources were the norm, not the exception. You do not get to ignore that in production.
Approach
Each forecasting problem required its own modeling strategy. No universal solution.
Store Fulfillment Capacity: Time-series forecasting (LSTM + XGBoost ensemble) to predict capacity demand 2 to 14 days ahead. Feature engineering on historical utilization patterns, day-of-week effects, and event flags.
Delivery Date Estimation: Regression with uncertainty bounds. Critical design decision here: a confident wrong estimate is worse than an honest uncertain one. Used quantile regression to output confidence intervals, not point estimates.
Sales Returns Forecasting: Category-specific models - return rates for electronics differ structurally from apparel. Hierarchical models with product hierarchy as structure.
Replenishment Forecasting: Joint demand and lead-time modeling. The hard part: replenishment decisions need to account for supplier variability, not just demand variability.
Inventory Estimation: State estimation with Kalman filter components - tracking inventory levels between physical counts using transaction data.
Markdown Optimization: Framed as a price-response problem. Trained demand-at-price models, then used optimization to find the markdown timing and depth that maximizes revenue recovery.
Stockout Avoidance: Binary classification with threshold tuning. High recall requirement - a missed stockout is more costly than a false alarm.
Engineering & Implementation
Built for scale from day one. Not retrofitted.
- Data pipeline: Apache Beam on Google Cloud Dataflow for bulk ingestion and feature engineering - parallel processing of multi-terabyte data at production volumes
- Training pipeline: TFX with Kubeflow orchestration - reproducible, versioned, monitored model training
- Serving: BigQuery for batch inference; auto-scaled Kubernetes for real-time endpoints
- Feature store: Centralized feature store to prevent training/serving skew - one of the most important architectural decisions we made
- Model versioning: Continuous deployment with canary rollouts - new model versions served to 5% of traffic before full promotion
- Monitoring: Data drift detection, prediction distribution monitoring, business metric tracking
The team was 15+ engineers and Machine Learning practitioners. Working at this scale required discipline around interfaces, testing, and documentation that smaller teams rarely develop.
Results & Impact
What changed because of this work: enterprise retailers on Blue Yonder’s platform stopped running replenishment and markdown decisions on siloed, manual rules and started running them on these models. The forecasts became inputs to the planning systems clients used every day.
- 7 production models deployed across supply-chain verticals, processing 5TB+ of live data reliably through production pipelines
- Replenishment decisions - how much to reorder and when - were driven by the joint demand-and-lead-time forecasts, accounting for supplier variability rather than demand alone
- Markdown decisions - when to discount aging inventory and how deeply - were driven by the demand-at-price models and the optimization layer on top of them
- Fulfillment and delivery commitments were set against capacity forecasts and honest, uncertainty-bounded delivery-date estimates rather than point guesses
- Outputs were served to enterprise retail clients at scale through the SaaS platform, replacing per-vertical ad-hoc models with a consistent, monitored forecasting layer
- My direct contributions: the delivery date estimation, replenishment forecasting, and markdown optimization models - the three that sit closest to the client’s day-to-day reorder and discount decisions
A note on numbers: client-specific accuracy gains and dollar impact are Blue Yonder’s to report, not mine. This case study describes the systems and the decisions they drove. .
Limitations & What I’d Do Differently
Hierarchical reconciliation between SKU-level and store-level forecasts was handled differently per model rather than systematically. This created inconsistencies at aggregation. A unified hierarchical forecasting framework would have been cleaner.
The feature store was valuable but expensive to maintain. Tighter discipline on which features actually moved the needle would have reduced complexity without sacrificing performance.
Stack
Python, TensorFlow, Keras, TFX, Kubeflow, Apache Beam, Google Dataflow, BigQuery, Kubernetes, XGBoost, LightGBM, Scikit-learn