Overview
At Blue Yonder, I worked on a 15-person team building cloud-native SaaS ML solutions for enterprise omnichannel supply-chain decision-making. The scale was unlike anything I’d worked on before: 5TB+ of live, noisy, high-dimensional retail and logistics data, with real-time forecasts driving operational decisions for some of the world’s largest retailers.
My contributions spanned 7 forecasting problem types — from store fulfillment capacity to markdown optimization — each with its own data characteristics, business objectives, and production constraints.
The Problem
Retail supply chains are decisions stacked on decisions: how much inventory to hold, when to replenish, how much capacity to reserve for fulfillment, when and how deeply to discount aging inventory. Each of these decisions had historically been made with siloed, ad-hoc models or manual rules that couldn’t adapt to changing demand patterns, stockouts, or external events.
The challenge wasn’t just modeling accuracy — it was building systems that could ingest 5TB+ of data, produce reliable predictions within latency budgets, and be understood by business stakeholders who needed to act on the outputs.
Why It Mattered
For enterprise retailers operating at scale, a 1% improvement in fulfillment capacity utilization or a 2% reduction in markdowns translates to tens of millions of dollars in impact. Wrong inventory positions create stockouts or overstock — both costly. Inaccurate markdown timing means leaving money on the table or burning margin unnecessarily. The forecasts weren’t academic — every output was connected to an operational decision.
Data & Inputs
- Historical sales data: SKU-level, store-level, timestamped — 5TB+ raw
- Inventory levels, purchase orders, supplier lead times
- Fulfillment center capacity logs — historical and real-time
- Return rates and reverse logistics patterns
- External data: calendar events, promotions, competitor pricing signals
- Demand signals: clickstream, web traffic, search trends (where available)
Data quality was a first-class problem. Missing values, encoding errors, outliers from one-off events, and inconsistent SKU definitions across client data sources were the norm, not the exception.
Approach
Each forecasting problem required its own modeling strategy:
Store Fulfillment Capacity: Time-series forecasting (LSTM + XGBoost ensemble) to predict capacity demand 2–14 days ahead. Feature engineering on historical utilization patterns, day-of-week effects, and event flags.
Delivery Date Estimation: Regression with uncertainty bounds. Critical: a confident wrong estimate is worse than an honest uncertain one. Used quantile regression to output confidence intervals, not point estimates.
Sales Returns Forecasting: Category-specific models — return rates for electronics differ structurally from apparel. Hierarchical models with product hierarchy as structure.
Replenishment Forecasting: Joint demand and lead-time modeling. The hard part: replenishment decisions need to account for supplier variability, not just demand variability.
Inventory Estimation: State estimation with Kalman filter components — tracking inventory levels between physical counts using transaction data.
Markdown Optimization: Framed as a price-response problem. Trained demand-at-price models, then used optimization to find the markdown timing and depth that maximizes revenue recovery.
Stockout Avoidance: Binary classification + threshold tuning. High recall requirement — a missed stockout is more costly than a false alarm.
Engineering & Implementation
The production stack was built for scale:
- Data pipeline: Apache Beam on Google Cloud Dataflow for bulk ingestion and feature engineering — parallel processing of multi-terabyte data at production volumes
- Training pipeline: TensorFlow Extended (TFX) with Kubeflow orchestration — reproducible, versioned, monitored model training
- Serving: BigQuery for batch inference; auto-scaled Kubernetes for real-time endpoints
- Feature store: centralized feature store to avoid training/serving skew — one of the most important architectural decisions
- Model versioning: Continuous deployment with canary rollouts — new model versions served to 5% of traffic before full promotion
- Monitoring: Data drift detection, prediction distribution monitoring, business metric tracking
The team was 15+ engineers and ML practitioners. Working at this scale required discipline around interfaces, testing, and documentation that smaller teams don’t always need.
Results & Impact
- 7 production models deployed across supply-chain verticals
- 5TB+ data processed reliably through production pipelines
- Forecasting systems serving enterprise retail clients at scale via SaaS platform
- Improved fulfillment capacity utilization and reduced inventory holding costs across client base
- Personal contributions: delivery date estimation, replenishment forecasting, and markdown optimization models
Limitations & What I’d Do Differently
Hierarchical reconciliation between SKU-level and store-level forecasts was handled differently per model rather than systematically — this created inconsistencies at aggregation. A unified hierarchical forecasting framework (like MINT or similar) would have been cleaner.
The feature store was valuable but expensive to maintain. In retrospect, tighter discipline on which features actually moved the needle would have reduced complexity.
Stack
Python, TensorFlow, Keras, TFX, Kubeflow, Apache Beam, Google Dataflow, BigQuery, Kubernetes, XGBoost, LightGBM, Scikit-learn