Why Scaling Is a Distinct Problem
Training a model that achieves good accuracy on a benchmark dataset is a different problem from running that model in production on millions of requests. And both are different from training the model in the first place when data has grown beyond what fits on a single machine. Scaling in ML has three separate dimensions — data, compute, and serving — and each requires different techniques.
Scaling Data: Beyond Single-Machine Memory
When training data exceeds RAM or disk on a single machine, two strategies apply:
Out-of-core learning: process data in batches from disk. Scikit-learn’s partial_fit API and TensorFlow’s tf.data pipeline both support this. The key constraint: the model must be updatable incrementally — batch gradient descent becomes mini-batch stochastic gradient descent.
Distributed data loading: partition the dataset across machines. Each worker loads its shard, computes gradients on it, and contributes to the global update. TensorFlow Data Services and PyTorch DataLoader with distributed samplers implement this. The bottleneck shifts from compute to I/O — data loading must keep the GPU fed without becoming the bottleneck.
Feature preprocessing at scale: when the full feature engineering pipeline (normalization stats, vocabulary, embeddings) must be computed over the full dataset before training, tools like Apache Beam (for distributed computation) and TFX Transform (for feature preprocessing that persists across training and serving) are necessary. Computing the global mean and variance for normalization over 5TB requires a distributed pass — not something that can be done in memory.
Scaling Compute: Distributed Training
Data Parallelism
The dominant approach: replicate the model across $N$ devices, partition the batch across devices, compute gradients independently on each, then aggregate and update.
Synchronous data parallelism (AllReduce): at each step, all devices compute gradients on their batch shard, then execute an AllReduce operation to sum gradients across all devices before updating. The update is globally consistent — every device sees the same update. MirroredStrategy in TensorFlow and DistributedDataParallel in PyTorch implement this.
The AllReduce communication cost scales with the number of parameters. For small-to-medium models (< 1B parameters), the communication overhead is negligible relative to compute. For very large models, communication becomes the bottleneck.
Asynchronous data parallelism (Parameter Server): workers push gradients to a central parameter server, which applies updates and sends back the updated parameters. Workers do not wait for each other — faster throughput but introduces “stale gradients” (a worker updates based on parameters that may have been modified by other workers since it started its step). ParameterServerStrategy in TensorFlow implements this. Generally used when the network is slow or worker speeds are heterogeneous.
Model Parallelism
When the model itself is too large to fit on a single device, partition the model across devices:
Pipeline parallelism: split the model into sequential stages, each on a different device. Data flows through stages in a pipeline — while stage 2 processes batch $k$, stage 1 processes batch $k+1$. This creates a pipeline bubble at the start and end of each step, so efficiency depends on the number of microbatches.
Tensor parallelism: split individual layers across devices. A matrix multiplication $W x$ can be split such that each device holds a partition of $W$ and computes its part of the output in parallel. Used in Megatron-LM and similar large-scale training frameworks.
The Compute Efficiency Stack
Moving top to bottom reduces overhead at each level:
| Level | Technique | Speedup |
|---|---|---|
| Framework | TensorFlow/PyTorch JIT compilation (tf.function, torch.compile) | 1.2–3× |
| Precision | Mixed precision (FP16 forward/backward, FP32 master weights) | 2–3× on modern GPUs |
| Hardware | GPU vs. CPU for dense matrix ops | 10–100× |
| Parallelism | Data parallel across $N$ GPUs | ~$N$× (minus communication) |
| Architecture | Efficient attention (FlashAttention) vs. naive | 2–4× for transformer models |
Mixed precision training is almost always worth enabling: FP16 operations on modern GPUs use specialized Tensor Cores that are 2–3× faster than FP32, with negligible accuracy loss when a loss scaler is used to prevent gradient underflow.
Scaling Inference: Serving at Throughput
Training and serving have different constraints. Training optimizes for throughput (maximize GPU utilization). Serving optimizes for latency (minimize response time) and cost (minimize compute per request).
Batching: inference throughput scales with batch size up to the GPU memory limit. Serving systems batch concurrent requests together — a single user request is wrapped in a batch of $N$ requests processed simultaneously. The tradeoff: larger batches increase throughput but increase latency for the first request in the batch.
Model quantization: convert FP32 weights to INT8 or INT4. Modern quantization (GPTQ, AWQ for LLMs; TFLite quantization for CNNs) achieves near-FP32 accuracy at 2–4× smaller model size and faster inference. Essential for mobile deployment.
Model distillation: train a smaller “student” model to mimic a larger “teacher” model’s outputs. The student is faster to serve while retaining much of the teacher’s accuracy — particularly effective when the student is trained on the teacher’s soft probability outputs rather than hard labels.
Caching: for inputs with repeated patterns (common queries, cached embeddings), KV caching (in transformers) and embedding caches reduce redundant computation significantly.
Blue Yonder Case: 5TB Supply Chain Pipelines
At Blue Yonder, the supply chain forecasting pipelines processed 5TB+ of demand history across thousands of SKUs and distribution centers. The scaling challenges were:
- Feature preprocessing: computing rolling statistics (28-day moving averages, lag features, seasonality indices) over the full history required distributed Apache Beam jobs on GCP Dataflow
- Training throughput: XGBoost and LightGBM training on partitioned datasets (one model per product category or location cluster) was parallelized across workers using Kubeflow pipelines
- Serving latency: demand forecasts needed to update daily before replenishment planning — the full pipeline (data ingestion → feature computation → inference → output export) had a hard wall-clock deadline
The key lesson: at scale, the bottleneck is almost never the model training step — it is the data pipeline (I/O, preprocessing, shuffling) and the orchestration overhead between steps.