TL;DR
- Performs forward + backward on multiple micro-batches, accumulating gradients in place, and only calls the optimiser once every N steps — equivalent to training with an N× larger batch.
- The standard technique for hitting large global batch sizes when per-device memory is the binding constraint.
- Composes cleanly with DDP, FSDP, ZeRO, and pipeline parallelism — almost always the right knob to turn when you want a bigger effective batch.
Overview#
Large-batch training is well-understood to improve throughput per FLOP and often improves final quality. But VRAM is finite — a 2,048-token, 70B-parameter forward pass already saturates most GPUs. Gradient accumulation closes the gap: run the forward + backward on a per-device micro-batch that fits, accumulate gradients in `.grad`, repeat N times, then call `optimizer.step()`. The result is mathematically equivalent to training at N× the per-device batch size.
Mechanism#
In PyTorch, the pattern is: skip `optimizer.zero_grad()` between micro-steps, skip `loss.backward()`'s default AllReduce in DDP (use `model.no_sync()`), and only step + zero_grad on the Nth iteration. Frameworks like HuggingFace Trainer, Accelerate, and Lightning expose `gradient_accumulation_steps` as a single config parameter and handle the no_sync gymnastics internally.
Inside pipeline parallelism, gradient accumulation is the same thing as the number of pipeline micro-batches (M in the pipeline-parallelism entry). Increasing M reduces the pipeline bubble proportionally; it is the standard knob for trading wall-clock time for memory pressure.
Performance Characteristics#
- Compute: extra N micro-steps cost ~N × the per-micro-batch forward+backward; total wall time grows roughly linearly with N for fixed global batch.
- Memory: peak memory governed by micro-batch size, not global batch size — the whole point.
- Communication: with DDP/FSDP, only the Nth micro-step triggers the AllReduce, so per-global-step communication is fixed.
When to Use#
Use gradient accumulation whenever the global batch you want is larger than what fits on the device. It is essentially free to enable and composes with every other parallelism strategy. The only real choice is N — small enough to not waste wall-clock, large enough to hit the target global batch.
Pitfalls#
- Loss must be averaged across micro-steps, not summed — or scale the learning rate accordingly.
- Batch-norm statistics are computed per micro-batch, not per global batch — use SyncBatchNorm or LayerNorm to avoid surprises.
- Forgetting `no_sync()` in DDP causes an AllReduce every micro-step instead of every Nth — wastes interconnect.
- Learning-rate schedules count optimiser steps, not micro-steps — adjust accordingly.
Software#
- Built into every modern training framework — HuggingFace Trainer (`gradient_accumulation_steps`), Accelerate, Lightning, FSDP, DeepSpeed, Megatron-LM.
- PyTorch DDP `no_sync()` context for raw implementations.
- Pipeline-parallel frameworks treat accumulation as 'micro-batches per pipeline fill'.
References
- PyTorch gradient accumulation tutorial · PyTorch
- HuggingFace Accelerate gradient accumulation · HuggingFace