Case Studies 7 Production Forecasting Models Driving Replenishment & Markdown Decisions for Blue Yonder's Enterprise Retailers (5TB+)
ml Logistics & Supply Chain (SaaS) · 2022–2023

7 Production Forecasting Models Driving Replenishment & Markdown Decisions for Blue Yonder's Enterprise Retailers (5TB+)

Shipped 7 production forecasting models across Blue Yonder's supply-chain verticals - fulfillment, replenishment, inventory, returns, and markdown - running on 5TB+ of noisy retail data through a GCP/TFX/Beam/Dataflow stack. Their outputs fed the replenishment and markdown decisions of enterprise retail clients on the SaaS platform.

Problem

No unified forecasting system across fulfillment, replenishment, inventory, returns, and markdown decisions - each vertical ran on ad-hoc models or manual rules that could not adapt to demand shifts, stockouts, or external events.

Outcome

7 production forecasting models deployed on a GCP/TFX/Apache Beam/Dataflow stack processing 5TB+ of live supply-chain data, served to enterprise retail clients through Blue Yonder's SaaS platform by a 15-member team.

Impact - who used it & what changed

Blue Yonder's enterprise retail clients had their replenishment and markdown decisions driven directly by these models - forecasts of demand, lead time, and price response fed the planning systems that decided how much to reorder, when, and how deeply to discount aging inventory.

Overview

Blue Yonder sells supply-chain planning software to some of the world’s largest retailers. When those retailers decide how much stock to reorder, when to replenish, and how deeply to mark down aging inventory, those decisions run on forecasts. My job was to build the forecasts.

5TB+ of live, noisy, high-dimensional retail and logistics data. Seven distinct forecasting problems, each with its own data characteristics, business objective, and production constraint. Enterprise clients whose operational decisions were driven directly by the outputs.

I was not building research prototypes. I was shipping forecasting systems that fed the planning loop for real retailers - fulfillment capacity, inventory allocation, markdown timing, replenishment. Each model had to run reliably, within latency budgets, and produce outputs that the SaaS platform’s planning engines and client stakeholders could act on without a data scientist in the loop.

The Problem

Retail supply chains are decisions stacked on decisions. How much inventory to hold. When to replenish. How much capacity to reserve for fulfillment. When and how deeply to discount aging inventory. Each of these decisions had historically been made with siloed, ad-hoc models or manual rules that could not adapt to changing demand patterns, stockouts, or external events.

The challenge was not just modeling accuracy. It was building systems that could ingest 5TB+ of data, produce reliable predictions within latency budgets, and communicate results clearly to business stakeholders who needed to act on the outputs.

Why It Mattered

For enterprise retailers operating at scale, the economics are unforgiving: small percentage shifts in fulfillment capacity utilization or markdown waste translate to large absolute numbers. Wrong inventory positions create stockouts or overstock - both costly. Inaccurate markdown timing means leaving money on the table or burning margin unnecessarily. The forecasts were not academic. Every output was wired to an operational decision a client would actually make, which set the bar for reliability and honesty about uncertainty far higher than a research setting would.

Data & Inputs

Data quality was a first-class problem. Missing values, encoding errors, outliers from one-off events, and inconsistent SKU definitions across client data sources were the norm, not the exception. You do not get to ignore that in production.

Approach

Each forecasting problem required its own modeling strategy. No universal solution.

Store Fulfillment Capacity: Time-series forecasting (LSTM + XGBoost ensemble) to predict capacity demand 2 to 14 days ahead. Feature engineering on historical utilization patterns, day-of-week effects, and event flags.

Delivery Date Estimation: Regression with uncertainty bounds. Critical design decision here: a confident wrong estimate is worse than an honest uncertain one. Used quantile regression to output confidence intervals, not point estimates.

Sales Returns Forecasting: Category-specific models - return rates for electronics differ structurally from apparel. Hierarchical models with product hierarchy as structure.

Replenishment Forecasting: Joint demand and lead-time modeling. The hard part: replenishment decisions need to account for supplier variability, not just demand variability.

Inventory Estimation: State estimation with Kalman filter components - tracking inventory levels between physical counts using transaction data.

Markdown Optimization: Framed as a price-response problem. Trained demand-at-price models, then used optimization to find the markdown timing and depth that maximizes revenue recovery.

Stockout Avoidance: Binary classification with threshold tuning. High recall requirement - a missed stockout is more costly than a false alarm.

Engineering & Implementation

Built for scale from day one. Not retrofitted.

The team was 15+ engineers and Machine Learning practitioners. Working at this scale required discipline around interfaces, testing, and documentation that smaller teams rarely develop.

Results & Impact

What changed because of this work: enterprise retailers on Blue Yonder’s platform stopped running replenishment and markdown decisions on siloed, manual rules and started running them on these models. The forecasts became inputs to the planning systems clients used every day.

A note on numbers: client-specific accuracy gains and dollar impact are Blue Yonder’s to report, not mine. This case study describes the systems and the decisions they drove. .

Limitations & What I’d Do Differently

Hierarchical reconciliation between SKU-level and store-level forecasts was handled differently per model rather than systematically. This created inconsistencies at aggregation. A unified hierarchical forecasting framework would have been cleaner.

The feature store was valuable but expensive to maintain. Tighter discipline on which features actually moved the needle would have reduced complexity without sacrificing performance.

Stack

Python, TensorFlow, Keras, TFX, Kubeflow, Apache Beam, Google Dataflow, BigQuery, Kubernetes, XGBoost, LightGBM, Scikit-learn

Stack

Python TensorFlow TFX Apache Beam Dataflow BigQuery Kubernetes XGBoost LightGBM
supply-chain forecasting production-ml deep-learning time-series

Have a problem worth solving?

Whether you need a quantitative researcher, a Machine Learning systems builder, or a technical advisor, I take a small number of consulting engagements at a time.

Book a call →