Data Parallelism (DP)

TL;DR

Each worker holds a full replica of the model and processes a different micro-batch of data; gradients are averaged across workers via AllReduce before the optimiser step.
Communication volume is proportional to model size, not batch size — so DP scales well for small-to-medium models but stalls once parameter count outgrows interconnect bandwidth.
Pure DP is the building block underneath every modern distributed-training stack; ZeRO and FSDP are sharded variants, DDP is the canonical PyTorch implementation.

Overview#

Data parallelism is the oldest and most widely used distributed training pattern. Each worker (typically one GPU) keeps a full copy of the model, the optimiser state, and the gradients. The global batch is split into per-worker micro-batches; each worker runs a forward and backward pass independently; then an AllReduce synchronises the gradients so every replica converges on the same updated weights.

Conceptually it is the easiest parallelism to reason about — the loss surface is the single-GPU loss surface, just with a larger effective batch — which is why every higher-order strategy (3D parallelism, ZeRO, FSDP) layers on top of it rather than replacing it.

Mechanism#

The classical implementation is PyTorch DistributedDataParallel (DDP). DDP hooks into the autograd engine: as each parameter's `.grad` is produced during the backward pass, it is bucketed and an asynchronous NCCL AllReduce is launched in the background. By the time the backward pass finishes, most of the communication has already overlapped with computation and only the tail remains exposed.

Mathematically, the AllReduce computes the mean gradient across N workers: g = (1/N) Σ gᵢ. Provided the learning rate and batch-norm statistics are handled correctly, this is exactly equivalent to a single GPU training on N× the batch size — the linear-scaling rule popularised by the 2017 Facebook AI ImageNet-in-1-hour paper.

Performance Characteristics#

Per-step communication volume is 2 × P bytes per worker, where P is the model parameter count in the chosen dtype (the factor of 2 comes from the ring-AllReduce algorithm). For a 7B BF16 model this is ~28 GB per worker per step; for a 70B model it is ~280 GB per worker per step — which is why pure DP becomes interconnect-bound well before reaching frontier model sizes.

Compute scaling: linear in workers, provided the global batch can be increased proportionally.
Memory scaling: zero — every worker still holds the full model, so DP does not help you fit larger models.
Communication: bandwidth-bound on slow fabrics, latency-bound at very large worker counts.
Sweet spot: models small enough to fit on one GPU with room for activations (<7B parameters in BF16 on 80 GB).

When to Use#

Use pure DDP when the model and its activations comfortably fit on one GPU and you only need to throw more hardware at training to get through data faster. For anything larger — or any setting where optimiser-state memory dominates — move to FSDP or DeepSpeed ZeRO, which preserve the DP programming model while sharding state across workers.

Pitfalls#

Batch-norm layers leak information across workers unless SyncBatchNorm is used; LayerNorm (the LLM default) is unaffected.
Linear-scaling-rule breaks down past a critical batch size — too large a global batch hurts generalisation and forces learning-rate gymnastics.
Gradient accumulation inside DDP requires `no_sync()` contexts to avoid an AllReduce on every micro-step.
On InfiniBand fabrics, NCCL AllReduce is sensitive to topology; misconfigured rail-optimised routing can halve effective bandwidth.

Software#

PyTorch DistributedDataParallel (torch.nn.parallel.DistributedDataParallel).
Horovod — Uber's MPI-based DP framework, still used in some TensorFlow and JAX shops.
JAX pmap / shard_map for the JAX ecosystem equivalent.
NCCL provides the underlying ring/tree AllReduce kernels on NVIDIA hardware; RCCL is the AMD equivalent.

References

PyTorch Distributed: Experiences on Accelerating Data Parallel Training · arXiv (Li et al., 2020)
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour · arXiv (Goyal et al., 2017)
PyTorch DDP documentation · PyTorch

Overview#

Mechanism#

Performance Characteristics#

Compute scaling: linear in workers, provided the global batch can be increased proportionally.

Memory scaling: zero — every worker still holds the full model, so DP does not help you fit larger models.

Communication: bandwidth-bound on slow fabrics, latency-bound at very large worker counts.

Sweet spot: models small enough to fit on one GPU with room for activations (<7B parameters in BF16 on 80 GB).

When to Use#

Pitfalls#

Batch-norm layers leak information across workers unless SyncBatchNorm is used; LayerNorm (the LLM default) is unaffected.

Linear-scaling-rule breaks down past a critical batch size — too large a global batch hurts generalisation and forces learning-rate gymnastics.

Gradient accumulation inside DDP requires `no_sync()` contexts to avoid an AllReduce on every micro-step.

On InfiniBand fabrics, NCCL AllReduce is sensitive to topology; misconfigured rail-optimised routing can halve effective bandwidth.

Software#

PyTorch DistributedDataParallel (torch.nn.parallel.DistributedDataParallel).

Horovod — Uber's MPI-based DP framework, still used in some TensorFlow and JAX shops.

JAX pmap / shard_map for the JAX ecosystem equivalent.

NCCL provides the underlying ring/tree AllReduce kernels on NVIDIA hardware; RCCL is the AMD equivalent.

Data Parallelism (DP)

Overview#

Mechanism#

Performance Characteristics#

When to Use#

Pitfalls#

Software#

References

Browse all entries

Deploy on Yobitel

Data Parallelism (DP)

Overview#

Mechanism#

Performance Characteristics#

When to Use#

Pitfalls#

Software#

References

Browse all entries

Deploy on Yobitel