Pipeline Parallelism (PP)

TL;DR

Partitions a model by layer — each pipeline stage owns a contiguous block of layers on its own GPU(s). Micro-batches stream forwards through the stages then backwards; with enough micro-batches the pipeline bubble (idle time at fill and drain) is small.
Lineage: GPipe (Huang et al., 2018, arXiv:1811.06965) introduced synchronous PP; PipeDream (Narayanan et al., 2019) introduced 1F1B (one-forward-one-backward) which halves activation peak; Megatron-LM's interleaved 1F1B (Narayanan et al., 2021, arXiv:2104.04473) further cuts the bubble; ZB1P (zero-bubble pipeline parallelism, 2024) splits backward into B and W phases to nearly eliminate it.
Tolerates lower interconnect bandwidth than TP because communication is point-to-point activation passing, not block-wide AllReduce — InfiniBand NDR (400 Gb/s) is comfortable. Standard across-node axis in 3D parallelism (TP=8 inside node x PP across nodes x DP for the rest), used for every 175B+ training run.
Yobitel NeoCloud multi-rack training topology pins PP boundaries to rack-level InfiniBand uplinks so micro-batch P2P traffic stays predictable; Yobibyte's managed fine-tune service handles topology mapping automatically when a customer recipe requires PP.

Overview#

Pipeline parallelism, also called inter-layer model parallelism, splits a deep model into K stages. Stage 0 holds the first contiguous chunk of layers, stage 1 the next, and so on. A mini-batch is divided into M micro-batches; each micro-batch flows forward through the pipeline (stage 0 -> stage 1 -> ... -> stage K-1), then its backward pass flows in reverse. The crucial trick is overlap: while stage 0 processes micro-batch 2, stage 1 is processing micro-batch 1, stage 2 is processing micro-batch 0, and so on. With enough micro-batches in flight, every stage stays busy for most of the step.

GPipe (2018) introduced the synchronous variant — fill the pipeline with M forwards, then drain with M backwards, then take one optimiser step. PipeDream (2019) introduced 1F1B scheduling — interleave one forward and one backward per stage in steady state — which halves the peak activation memory because each stage holds at most ~K micro-batches of activations instead of M. Megatron-LM's interleaved 1F1B (2021) chops each stage into V virtual sub-stages further reducing the bubble. ZB1P (zero-bubble pipeline parallelism, 2024) splits the backward into input-gradient (B) and weight-gradient (W) phases, allowing W to be deferred, which nearly eliminates the bubble at the cost of more bookkeeping.

Pipeline parallelism's defining property is that communication is point-to-point activation passing between adjacent stages, not block-wide AllReduce inside every transformer block. P2P traffic is proportional to activation size at the stage boundary — for Llama 3.1 70B with hidden=8192, seq=8192, BF16, that is ~268 MB per micro-batch boundary. InfiniBand NDR (50 GB/s effective) moves this in ~5 ms, fully overlappable with the next forward/backward. This bandwidth tolerance is why PP is the across-node axis in every frontier-scale 3D parallelism recipe.

On Yobitel NeoCloud, the multi-rack training pod topology pins PP boundaries to rack-level InfiniBand NDR uplinks (8x 400 Gb/s per rack) so micro-batch P2P traffic stays predictable, and the pre-validated Megatron-LM recipes use TP=8 inside each HGX H100/H200/B200 baseboard with PP across rack-attached nodes. Yobibyte's managed fine-tune service handles the topology mapping automatically when a customer recipe requires PP — most customer fine-tunes do not, but customers who pre-train from scratch above 70B benefit from PP without having to specify the rack layout themselves.

This entry helps you decide when pipeline parallelism is the right choice, which schedule (GPipe / 1F1B / interleaved / ZB1P) fits your workload, how to size K and M, how to compose PP with TP / DP / SP / CP / EP, and how to manage the load-balance and activation-checkpointing constraints that determine whether your pipeline runs at 50 percent of peak or 30 percent.

How it works#

The pipeline bubble — wasted GPU time at pipeline fill and drain — has size (K-1) / (M + K-1) of the total compute under synchronous GPipe scheduling, where K is the stage count and M is the micro-batches per pipeline fill. With K=8 stages and M=64 micro-batches, the bubble is ~10 percent; with M=256 it is ~3 percent. Pipeline efficiency therefore rises with the number of micro-batches, which in turn requires a large global batch size. Frontier training runs typically target M >= 8K to keep the bubble below 5 percent.

Communication is point-to-point sends/receives between adjacent stages — activations forward, gradient-of-activation backward. The volume is proportional to (micro_batch x seq_length x hidden_size x dtype_bytes), not parameter count, so PP is far less bandwidth-hungry than TP. The NCCL send/recv primitive backs this and overlaps with the next forward/backward compute when prefetch is configured.

1F1B scheduling restructures the timeline. After the initial warmup (where stage k issues k+1 forwards before its first backward), the steady state alternates one forward and one backward per micro-batch on every stage. Peak activation memory is bounded by K (number of stages) instead of M (number of micro-batches), which is the qualitative difference from GPipe — for K=8, M=128, 1F1B holds 8 micro-batches of activations per stage versus GPipe's 128.

Interleaved 1F1B splits each stage into V virtual sub-stages — instead of stage 0 owning layers 0-9 contiguously, it owns layers 0-4 and layers 40-44 (two virtual sub-stages of 5 layers each). The pipeline cycles through the virtual sub-stages in round-robin, which exposes more micro-batches in flight at any moment and shrinks the bubble by a factor of V at the cost of V-times more P2P sends. The right V is usually 4-8 for frontier-scale runs.

Zero-bubble pipeline parallelism (ZB1P, 2024) further observes that the backward pass for a layer can be split into the input-gradient phase (B, needed to keep the pipeline flowing) and the weight-gradient phase (W, only needed before the optimiser step). By deferring W and prioritising B, the bubble can be reduced to near zero. ZB1P is in Megatron-LM as `--zero-bubble` and is the recommended default at frontier scale on H100/H200/B200.

Across-node 3D parallelism composes PP with TP and DP: TP=8 inside one HGX baseboard handles intra-block weight sharding; PP across racks handles layer-count sharding; DP across the remaining ranks handles batch sharding. The world size factors as world = TP x PP x DP. For Llama 3.1 405B training on 4096 H100s, the canonical recipe is TP=8, PP=16, DP=32, virtual-pipeline=5, ZB1P enabled — yielding ~92 percent weak-scaling efficiency vs single-node TP=8.

GPipe schedule: M forwards then M backwards then step; bubble = (K-1)/(M+K-1), peak activation = M.
1F1B (PipeDream): one forward + one backward per micro-batch in steady state; bubble same as GPipe but activation peak = K.
Interleaved 1F1B (Megatron): V virtual sub-stages per physical stage; bubble shrinks by V at V-fold P2P send cost.
ZB1P (zero-bubble, 2024): split backward into B and W phases; defer W to nearly eliminate the bubble.
P2P traffic: activation_size per micro-batch boundary; ~268 MB for Llama 70B at 4K seq / 1 micro-batch BF16.
Composes with TP, DP, SP, CP, EP — TP=8 intra-node, PP across racks, DP for the rest is the standard 3D recipe.

Maximise M (micro-batches per pipeline fill) before adding stages. A pipeline with K=16, M=16 is mostly bubble (~50 percent); K=8, M=64 is mostly useful work (~10 percent bubble) and produces the same global batch. Add stages only when memory forces it, not for throughput.

Variants and architectural choices#

Several PP schedules and combining strategies are used in practice. The table below summarises the variants and where each fits.

Variant	Bubble formula	Peak activation	Use case
GPipe (synchronous)	(K-1)/(M+K-1)	O(M)	Simple, high-M only; rarely used at frontier scale.
1F1B (PipeDream-Flush)	(K-1)/(M+K-1)	O(K)	The pre-2021 production default; still the simplest 1F1B.
Interleaved 1F1B (Megatron 2021)	(K-1)/(V*M+K-1)	O(K)	Frontier-scale training; V=4-8 reduces bubble materially.
ZB1P (zero-bubble, 2024)	~0 percent	O(K)	Newest default; Megatron-LM --zero-bubble flag.
PipeDream-2BW (async)	0 percent	O(K)	Async 2-buffered weight; introduces staleness, rarely used at scale.
Chimera (bi-directional)	~half of GPipe	O(K)	Two pipelines in opposite directions; specialist research, not production.
Hanayo / Wave PP	Sub-K bubble	O(K)	Recent research on irregular schedules; not yet mainstream.

When to use vs alternatives#

Use pipeline parallelism as the across-node axis of 3D parallelism for models too large to fit in one node's worth of TP. The standard frontier-model training recipe is TP=8 inside each HGX baseboard, PP across InfiniBand-connected racks, and DP for the remaining workers, optionally wrapped in ZeRO-1 / distributed Adam for optimiser-state sharding. The Yobitel NeoCloud reference for a 175B+ pretrain is exactly this shape: TP=8 within rack, PP across racks, DP for the rest.

PP is overkill below ~30B model size — FSDP or DeepSpeed ZeRO-3 fit easily on a single 8-GPU node and avoid the bubble entirely. At 70B, the trade-off is workload-dependent: HSDP across 32 GPUs beats PP=4 x DP=8 for SFT and continued pretraining because the bubble does not amortise; but for pretraining from random weights at 1T+ tokens, PP wins because it lets you scale to thousands of GPUs without the per-layer AllGather of FSDP.

For inference, pipeline parallelism is rare — KV cache makes per-token latency dependent on the deepest stage, and the bubble cost is per-token rather than per-batch. Standard inference at 70B is TP=4-8 inside one node; pipeline serving is reserved for niche very-large-model inference (DeepSeek-V3 in particular benefits from PP+EP at serving time on multi-node fabric).

If you can run on a single 8-GPU node with TP=8 plus FSDP / ZeRO-3, do that. PP is the right answer when you have run out of single-node parallelism axes and need to scale across the InfiniBand fabric. On Yobitel NeoCloud, training pods below 32 GPUs almost never need PP; pods above 64 GPUs almost always do for 70B+ pretraining.

Trade-offs and known limitations#

Layer count must be evenly divisible by stage count (and by V*K under interleaving), or some stages straggle. Llama 3.1 405B has 126 layers — not divisible by 8 or 16 cleanly; in practice the first and last stages get fewer layers to absorb the imbalance.
The first and last stages have extra work — embedding lookup, position embeddings, and loss computation. Load balance matters; Megatron's `--standalone-embedding-stage` keeps the embedding on its own micro-stage when this is the bottleneck.
Activation checkpointing inside PP stages multiplies recompute cost. Selective checkpointing (recompute only the largest activations) is the right default; full checkpointing per-block doubles the FLOP cost and rarely pays back.
Asynchronous PP (PipeDream-2BW style without flushes) introduces gradient staleness and is rarely used at frontier scale — synchronous 1F1B (or ZB1P) is the production default.
Pipeline-stage P2P sends must overlap with compute or the bubble grows. Megatron's `--overlap-p2p-communication` is non-optional at multi-rack scale.
Combining PP with expert parallelism (MoE) requires careful scheduling — the AllToAll for expert routing can deadlock if not interleaved with the pipeline send/recv.
Checkpoint format: a PP checkpoint is sharded by stage. Conversion to HuggingFace format requires gathering the stages, which is a non-trivial step at 405B scale.

Practical implementation notes#

Megatron-LM is the canonical PP implementation and the reference every other stack converged toward. Flags: `--pipeline-model-parallel-size K`, `--virtual-pipeline-model-parallel-size V` for interleaved 1F1B, `--num-microbatches-per-step M` (or set via global_batch_size / (DP * micro_batch_size)), `--overlap-p2p-communication`, `--zero-bubble` for ZB1P. NeMo wraps these with Hydra configs. DeepSpeed's `PipelineModule` exposes a similar 1F1B path that integrates with ZeRO-1/2.

PyTorch's native `torch.distributed.pipelining` (PT 2.4+) implements GPipe and 1F1B schedules on top of DTensor and is the recommended path for PyTorch-native projects that do not already use Megatron. The API is `pipeline()` plus `schedule.execute()` — see the PT 2.4 release notes.

On Yobitel NeoCloud, the multi-rack training pod topology is documented in the training-pod runbook: 8 GPUs per HGX baseboard, 4-8 baseboards per rack, rack-level InfiniBand NDR uplinks. PP boundaries are pinned to rack-level uplinks by the pre-validated Megatron-LM launch templates. Yobibyte's managed fine-tune service uses Megatron primitives under the hood and handles topology mapping automatically when a customer recipe (e.g. 100B+ from-scratch pretrain) requires PP; most customer fine-tunes stay below the PP threshold and use FSDP2 or DeepSpeed ZeRO-3 inside a single NeoCloud HSDP pod.

Megatron-LM: `--pipeline-model-parallel-size 16 --virtual-pipeline-model-parallel-size 5 --zero-bubble --overlap-p2p-communication`.
DeepSpeed: `PipelineModule(layers=..., num_stages=K, partition_method='uniform')`; pairs with ZeRO-1 for optimiser state.
PyTorch native: `from torch.distributed.pipelining import pipeline, ScheduleGPipe, Schedule1F1B` + DTensor mesh.
NeMo: `model.pipeline_model_parallel_size: 16` in Hydra config; same engine as Megatron Core.
NCCL: NCCL >= 2.20 for P2P over IB; NCCL_BUFFSIZE and NCCL_NET_GDR_LEVEL tuned for the fabric.
Observability: track per-stage iteration time, P2P p99 latency, bubble fraction (1 - useful_work_time / iteration_time).

Where pipeline parallelism sits in the Yobitel stack#

Pipeline parallelism is the across-node axis on Yobitel NeoCloud multi-rack training pods. The reference recipe for any 175B+ pretrain is TP=8 within rack x PP across racks x DP for the rest, all running on Megatron-LM or NeMo inside pre-validated containers. NeoCloud's rack-level InfiniBand NDR fabric (8x 400 Gb/s per rack) is sized for the activation P2P traffic this recipe generates, and the launch templates pin PP boundaries to the rack uplinks so cross-rack send/recv stays predictable.

For Yobibyte managed fine-tune and inference customers, PP is handled automatically — the platform selects the parallelism shape per model and per region and customers do not specify K, V, or M themselves. NeoCloud self-operating customers running their own Megatron-LM or NeMo launches keep full control of the PP shape and have access to the same fabric-tuned NCCL configuration and the documented topology map. For pretraining runs at the frontier-model scale that drive much of NeoCloud's design, PP is what allows the cluster to scale past a single rack without quadratic communication blow-up.

References

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism · arXiv (Huang et al., 2018)
PipeDream: Generalized Pipeline Parallelism for DNN Training · arXiv (Narayanan et al., 2019)
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM · arXiv (Narayanan et al., 2021)
Zero Bubble Pipeline Parallelism · arXiv (Qi et al., 2024)
PyTorch torch.distributed.pipelining documentation · PyTorch

TL;DR

Partitions a model by layer — each pipeline stage owns a contiguous block of layers on its own GPU(s). Micro-batches stream forwards through the stages then backwards; with enough micro-batches the pipeline bubble (idle time at fill and drain) is small.
Lineage: GPipe (Huang et al., 2018, arXiv:1811.06965) introduced synchronous PP; PipeDream (Narayanan et al., 2019) introduced 1F1B (one-forward-one-backward) which halves activation peak; Megatron-LM's interleaved 1F1B (Narayanan et al., 2021, arXiv:2104.04473) further cuts the bubble; ZB1P (zero-bubble pipeline parallelism, 2024) splits backward into B and W phases to nearly eliminate it.
Tolerates lower interconnect bandwidth than TP because communication is point-to-point activation passing, not block-wide AllReduce — InfiniBand NDR (400 Gb/s) is comfortable. Standard across-node axis in 3D parallelism (TP=8 inside node x PP across nodes x DP for the rest), used for every 175B+ training run.
Yobitel NeoCloud multi-rack training topology pins PP boundaries to rack-level InfiniBand uplinks so micro-batch P2P traffic stays predictable; Yobibyte's managed fine-tune service handles topology mapping automatically when a customer recipe requires PP.

Overview#

How it works#

GPipe schedule: M forwards then M backwards then step; bubble = (K-1)/(M+K-1), peak activation = M.
1F1B (PipeDream): one forward + one backward per micro-batch in steady state; bubble same as GPipe but activation peak = K.
Interleaved 1F1B (Megatron): V virtual sub-stages per physical stage; bubble shrinks by V at V-fold P2P send cost.
ZB1P (zero-bubble, 2024): split backward into B and W phases; defer W to nearly eliminate the bubble.
P2P traffic: activation_size per micro-batch boundary; ~268 MB for Llama 70B at 4K seq / 1 micro-batch BF16.
Composes with TP, DP, SP, CP, EP — TP=8 intra-node, PP across racks, DP for the rest is the standard 3D recipe.

Variants and architectural choices#

Several PP schedules and combining strategies are used in practice. The table below summarises the variants and where each fits.

Variant	Bubble formula	Peak activation	Use case
GPipe (synchronous)	(K-1)/(M+K-1)	O(M)	Simple, high-M only; rarely used at frontier scale.
1F1B (PipeDream-Flush)	(K-1)/(M+K-1)	O(K)	The pre-2021 production default; still the simplest 1F1B.
Interleaved 1F1B (Megatron 2021)	(K-1)/(V*M+K-1)	O(K)	Frontier-scale training; V=4-8 reduces bubble materially.
ZB1P (zero-bubble, 2024)	~0 percent	O(K)	Newest default; Megatron-LM --zero-bubble flag.
PipeDream-2BW (async)	0 percent	O(K)	Async 2-buffered weight; introduces staleness, rarely used at scale.
Chimera (bi-directional)	~half of GPipe	O(K)	Two pipelines in opposite directions; specialist research, not production.
Hanayo / Wave PP	Sub-K bubble	O(K)	Recent research on irregular schedules; not yet mainstream.

When to use vs alternatives#

Trade-offs and known limitations#

Layer count must be evenly divisible by stage count (and by V*K under interleaving), or some stages straggle. Llama 3.1 405B has 126 layers — not divisible by 8 or 16 cleanly; in practice the first and last stages get fewer layers to absorb the imbalance.
The first and last stages have extra work — embedding lookup, position embeddings, and loss computation. Load balance matters; Megatron's `--standalone-embedding-stage` keeps the embedding on its own micro-stage when this is the bottleneck.
Activation checkpointing inside PP stages multiplies recompute cost. Selective checkpointing (recompute only the largest activations) is the right default; full checkpointing per-block doubles the FLOP cost and rarely pays back.
Asynchronous PP (PipeDream-2BW style without flushes) introduces gradient staleness and is rarely used at frontier scale — synchronous 1F1B (or ZB1P) is the production default.
Pipeline-stage P2P sends must overlap with compute or the bubble grows. Megatron's `--overlap-p2p-communication` is non-optional at multi-rack scale.
Combining PP with expert parallelism (MoE) requires careful scheduling — the AllToAll for expert routing can deadlock if not interleaved with the pipeline send/recv.
Checkpoint format: a PP checkpoint is sharded by stage. Conversion to HuggingFace format requires gathering the stages, which is a non-trivial step at 405B scale.

Practical implementation notes#

Megatron-LM: `--pipeline-model-parallel-size 16 --virtual-pipeline-model-parallel-size 5 --zero-bubble --overlap-p2p-communication`.
DeepSpeed: `PipelineModule(layers=..., num_stages=K, partition_method='uniform')`; pairs with ZeRO-1 for optimiser state.
PyTorch native: `from torch.distributed.pipelining import pipeline, ScheduleGPipe, Schedule1F1B` + DTensor mesh.
NeMo: `model.pipeline_model_parallel_size: 16` in Hydra config; same engine as Megatron Core.
NCCL: NCCL >= 2.20 for P2P over IB; NCCL_BUFFSIZE and NCCL_NET_GDR_LEVEL tuned for the fabric.
Observability: track per-stage iteration time, P2P p99 latency, bubble fraction (1 - useful_work_time / iteration_time).

Where pipeline parallelism sits in the Yobitel stack#

References

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism · arXiv (Huang et al., 2018)
PipeDream: Generalized Pipeline Parallelism for DNN Training · arXiv (Narayanan et al., 2019)
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM · arXiv (Narayanan et al., 2021)
Zero Bubble Pipeline Parallelism · arXiv (Qi et al., 2024)
PyTorch torch.distributed.pipelining documentation · PyTorch

Pipeline Parallelism (PP)

Overview#

How it works#

Variants and architectural choices#

When to use vs alternatives#

Trade-offs and known limitations#

Practical implementation notes#

Where pipeline parallelism sits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

Pipeline Parallelism (PP)

Overview#

How it works#

Variants and architectural choices#

When to use vs alternatives#

Trade-offs and known limitations#

Practical implementation notes#

Where pipeline parallelism sits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel