Writing data engineering
data-engineering 9 min read 10 February 2023

Distributed Training in TensorFlow: MirroredStrategy vs. ParameterServerStrategy

A practical guide to TensorFlow's distribution strategies — how each works, when to use MirroredStrategy vs. ParameterServerStrategy, and the tradeoffs that determine which is faster.

Why Distribution Strategies Matter

A single GPU has a fixed memory ceiling and compute budget. When model training exceeds these limits — whether due to dataset size, model size, or time constraints — distributed training becomes necessary. TensorFlow’s tf.distribute API provides first-class abstractions for distributing work across GPUs and machines without rewriting training code.

Three strategies cover most use cases: MirroredStrategy for single-machine multi-GPU, MultiWorkerMirroredStrategy for multi-machine multi-GPU, and ParameterServerStrategy for asynchronous distributed training.

MirroredStrategy: Synchronous Multi-GPU on One Machine

How It Works

MirroredStrategy trains on multiple GPUs on a single machine using synchronous data parallelism:

  1. Initialization: model variables are created and replicated on every GPU — each GPU holds an identical copy of the model
  2. Forward pass: each GPU receives a different shard of the batch and computes its forward pass independently, in parallel
  3. Backward pass: each GPU computes gradients for its shard independently
  4. AllReduce: gradients are aggregated across all GPUs using all-reduce algorithms (NCCL on NVIDIA GPUs), producing a single combined gradient
  5. Update: all GPUs update their local copy with the combined gradient — they remain identical after each step
strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    model = build_model()
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

# Batch size is automatically divided across GPUs
model.fit(dataset, epochs=10)

The effective batch size equals batch_size_per_gpu × num_gpus. Adjust the learning rate accordingly (linear scaling rule: multiply LR by the number of GPUs when scaling batch size).

Benefits and Limitations

Benefits:

Limitations:

When AllReduce Becomes a Bottleneck

The threshold depends on hardware, but as a rule: when training time starts being dominated by inter-GPU communication rather than compute, consider ParameterServerStrategy or model parallelism. On modern NVLink interconnects (A100, H100), this threshold is much higher than on PCIe connections.


ParameterServerStrategy: Asynchronous Multi-Worker Training

How It Works

ParameterServerStrategy uses a parameter server architecture: a set of dedicated servers hold model variables, while worker nodes perform computation.

  1. Initialization: model variables are placed on parameter servers; each worker receives a copy
  2. Forward pass: each worker processes its own data shard using its local parameters
  3. Backward pass: each worker computes gradients independently
  4. Update: workers push their gradients to the parameter servers asynchronously; parameter servers apply updates immediately without waiting for other workers
  5. Distribute: workers pull the updated parameters from parameter servers before the next step
cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
strategy = tf.distribute.ParameterServerStrategy(cluster_resolver)

coordinator = tf.distribute.experimental.coordinator.ClusterCoordinator(strategy)

with strategy.scope():
    model = build_model()

@tf.function
def per_worker_step(iterator):
    return strategy.run(train_step, args=(next(iterator),))

for epoch in range(num_epochs):
    coordinator.schedule(per_worker_step, args=(distributed_dataset,))
    coordinator.join()

Benefits and Limitations

Benefits:

Limitations:


MultiWorkerMirroredStrategy: Synchronous Multi-Machine Multi-GPU

The logical extension of MirroredStrategy to multiple machines. Each machine uses MirroredStrategy internally, and the machines coordinate via AllReduce across machines (using NCCL for GPU-GPU communication over high-speed interconnects, or gRPC for slower networks).

strategy = tf.distribute.MultiWorkerMirroredStrategy()

with strategy.scope():
    model = build_model()
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

model.fit(dataset, epochs=10)

Environment variable TF_CONFIG must be set on each worker to define the cluster topology.

Best for: training large models on a cluster of GPU machines with high-bandwidth interconnects (InfiniBand, NVLink-C2C). The synchronous updates preserve all the convergence properties of single-GPU training.


Choosing the Right Strategy

ScenarioStrategy
Multiple GPUs, single machineMirroredStrategy
Multiple machines, homogeneous hardware, fast interconnectMultiWorkerMirroredStrategy
Many workers, heterogeneous hardware, or asynchronous updates acceptableParameterServerStrategy
Model too large for one GPUModel parallelism (pipeline or tensor) — not covered by these strategies

MirroredStrategy vs. ParameterServerStrategy: When Is Each Faster?

MirroredStrategy is typically faster when:

ParameterServerStrategy is typically faster when:

The practical test: run both on your hardware and compare wall-clock time per epoch and final convergence quality. The theoretical analysis does not always match hardware-specific performance.


Production Notes

Fault tolerance: MirroredStrategy has no built-in fault tolerance — one GPU failure kills the run. ParameterServerStrategy handles worker failures gracefully (a worker that fails is restarted and pulls current parameters). For long training runs on cloud hardware with preemption, this makes ParameterServerStrategy more resilient.

Batch size scaling: doubling the number of workers with MirroredStrategy doubles the effective batch size. This typically requires a corresponding increase in learning rate (linear scaling rule) and may require learning rate warmup for stability.

tf.function and distribution: always wrap the train step in @tf.function when using distribution strategies. The function is traced once per replica and compiled into a graph — running in eager mode with distribution strategies is significantly slower.

tensorflow distributed-training mlops gpu infrastructure
← All articles

Lets collaborate!

Whether you need a quantitative researcher, an machine learning systems builder, or a technical advisor — I'm available for select consulting engagements.

Get in Touch →