FSDP — Fully Sharded Data Parallel

TL;DR

Built into PyTorch (`torch.distributed.fsdp`), FSDP shards model parameters, gradients, and optimiser state across data-parallel workers — architecturally equivalent to DeepSpeed ZeRO-3 at the FULL_SHARD setting. BSD-licensed and shipped with every PyTorch wheel.
Originated from Meta's FairScale FullyShardedDataParallel (2021), upstreamed to core PyTorch 1.11 (2022); FSDP2 (PyTorch 2.4, 2024) rewrote it on top of DTensor with per-parameter sharding instead of flat parameters, making composition with TP, PP, and EP first-class.
Default choice for the PyTorch ecosystem when 3D parallelism is overkill — wired into HuggingFace Trainer, Accelerate, torchtune, Lightning, and TRL. The standard 8-128 GPU training and fine-tuning surface on Yobitel NeoCloud.
Tuning surface is small but meaningful: `auto_wrap_policy`, `sharding_strategy` (FULL_SHARD / SHARD_GRAD_OP / HYBRID_SHARD / NO_SHARD), `mixed_precision` (BF16 by default), `backward_prefetch`, `cpu_offload`, `use_orig_params`. Getting these right is most of the FSDP ops effort.

Overview#

FSDP is PyTorch's answer to ZeRO-3. Each worker owns 1/N of the parameters; before each forward layer (the 'wrapping unit'), the worker AllGathers the full layer parameters, runs the layer, then discards the gathered weights. Backward proceeds symmetrically with a ReduceScatter on gradients. The optimiser step is local on each worker's parameter shard. Memory per rank drops from O(P) for vanilla DDP to O(P/N) — the same factor as ZeRO-3 — at the cost of one extra AllGather per wrapping unit per pass.

FSDP first appeared inside Meta's FairScale library in 2021, was rewritten and upstreamed into `torch.distributed.fsdp` for PyTorch 1.11 (March 2022), and was rewritten again as FSDP2 in PyTorch 2.4 (2024). The FSDP2 rewrite replaced the FlatParameter abstraction with per-parameter DTensor sharding, which made FSDP composable with tensor parallelism (HSDP — Hybrid Sharded Data Parallel), pipeline parallelism, and expert parallelism without the special-case glue the FlatParameter design required.

By 2026 FSDP is the dominant sharded-data-parallel strategy in the PyTorch ecosystem. HuggingFace Trainer wires it up via `accelerate launch --use-fsdp`; HuggingFace Accelerate exposes it through `FullyShardedDataParallelPlugin`; torchtune (Meta's lightweight LLM fine-tuning library) is FSDP2-native; Lightning Fabric exposes it as a one-flag strategy; and TRL's SFT / DPO / PPO trainers all support FSDP. Yobitel NeoCloud customers fine-tuning Llama 3.1, Qwen 2.5, Mistral, and DeepSeek-V3 derivatives on 8-128 GPU H100/H200 pods commonly use FSDP2 as the default engine, and the Yobibyte managed fine-tune service wraps FSDP (under the hood) for QLoRA and full-parameter recipes on 7B-70B targets.

This entry documents the production surface: the FSDP2 API and its predecessor FSDP1, the four sharding strategies, mixed-precision and CPU-offload configuration, the auto_wrap_policy for transformer blocks, sizing tables for common Llama-class workloads, observability hooks, and the migration paths to and from DeepSpeed ZeRO. This entry helps you stand up FSDP for sharded training and fine-tuning on Yobitel NeoCloud or your own multi-GPU PyTorch cluster, with the right wrapping, sharding, and precision choices.

Quick start#

The example below fine-tunes Llama 3.1 8B on 8x H100 SXM5 using FSDP2 with FULL_SHARD, BF16 mixed precision, transformer-block-granularity auto-wrap, and `backward_prefetch=BACKWARD_PRE`. The first block installs PyTorch 2.4+ and writes a minimal training script that uses the FSDP2 API. The second block launches with `torchrun`. The third block shows the equivalent HuggingFace Accelerate path that picks up the same shape via YAML.

bash

# 1. Install PyTorch 2.4+ (FSDP2) and the model libraries
pip install "torch>=2.4" transformers accelerate datasets

cat > train_fsdp2.py <<'PY'
import os
import functools
import torch
import torch.distributed as dist
from torch.distributed._composable.fsdp import fully_shard, MixedPrecisionPolicy
from torch.distributed.device_mesh import init_device_mesh
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.models.llama.modeling_llama import LlamaDecoderLayer

dist.init_process_group(backend="nccl")
rank = dist.get_rank(); world = dist.get_world_size()
torch.cuda.set_device(rank % torch.cuda.device_count())

mesh = init_device_mesh("cuda", (world,), mesh_dim_names=("dp",))
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

mp_policy = MixedPrecisionPolicy(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.float32,   # keep gradient reduction in FP32 for stability
)

# Shard every transformer block, then shard the root model
for layer in model.model.layers:
    fully_shard(layer, mesh=mesh, mp_policy=mp_policy)
fully_shard(model, mesh=mesh, mp_policy=mp_policy)

optim = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.0,
                          fused=True)
# ... training loop: loss.backward() and optim.step() use the sharded params.
PY

# 2. Launch on 8x H100 SXM5
torchrun --standalone --nproc_per_node=8 train_fsdp2.py

# 3. Equivalent HuggingFace Accelerate config (fsdp.yaml)
cat > fsdp.yaml <<'YAML'
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
mixed_precision: bf16
num_machines: 1
num_processes: 8
fsdp_config:
  fsdp_version: 2
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_use_orig_params: true
  fsdp_offload_params: false
YAML
accelerate launch --config_file fsdp.yaml sft.py

On FSDP1, set `use_orig_params=True` so per-parameter optimisers (LoRA, layer-wise LR groups, parameter-specific weight decay) work as written. FSDP2 always uses original parameters and exposes this implicitly.

How it works#

FSDP shards three pieces of training state across the data-parallel group: parameters, gradients, and optimiser state. The compute graph is unchanged from DDP; what changes is which rank owns which bytes at which moment and the collective operations that move bytes in and out of GPU memory just in time. FSDP organises parameters into 'wrapping units' (usually transformer blocks) — the AllGather / ReduceScatter granularity is per-unit.

Forward pass: before unit U executes, FSDP issues `all_gather` on U's sharded parameters to reconstruct the full layer weights on every rank. The unit runs the forward, producing activations as usual. Immediately after the forward of U, the gathered copies are freed; only the local shard plus the activations remain in memory. If `backward_prefetch=BACKWARD_PRE` is set (the recommended default), the AllGather for unit U+1 is launched asynchronously while U is still computing, so most of the communication overlaps with compute.

Backward pass: each unit's full parameters are AllGathered again (FSDP2 can cache them when activations are still live, FSDP1 always re-gathers). The backward produces gradients of the full unit; FSDP then ReduceScatters the gradients across the FSDP group so each rank ends up with 1/N of the gradient. Gathered parameters are freed again. Once the entire backward completes, each rank has the gradient slice corresponding to its parameter slice.

Optimiser step: AdamW (or the chosen optimiser) runs locally on each rank's slice. There is no AllReduce; the per-rank update is mathematically equivalent to the global step because the work has been partitioned. After the step, FSDP holds only the updated parameter shards — no AllGather is required until the next forward.

Four sharding strategies are exposed: FULL_SHARD (true ZeRO-3, the default at FSDP2), SHARD_GRAD_OP (shard gradients and optimiser state but replicate parameters — ZeRO-2 equivalent, lower memory savings but no per-forward AllGather), NO_SHARD (degenerates to DDP, useful as a sanity baseline), and HYBRID_SHARD (FULL_SHARD inside a node-shaped sub-group, DDP-style replication across that sub-group — gives intra-node ZeRO-3 plus inter-node DDP, the canonical multi-node-pod choice).

Mixed precision is handled by `MixedPrecisionPolicy` (FSDP2) or `MixedPrecision` (FSDP1). Three dtypes are configurable: `param_dtype` (the working precision of gathered weights during forward/backward, usually BF16), `reduce_dtype` (the precision of the gradient ReduceScatter, often FP32 for numerical stability), and `output_dtype` (the dtype the model returns to user code). The Adam master weights remain in FP32 on the sharded slice regardless.

FULL_SHARD: params + grads + optimiser state sharded; ~N-fold memory cut; AllGather per forward + per backward + ReduceScatter per backward.
SHARD_GRAD_OP: grads + optimiser state sharded, params replicated; ~50 percent memory cut vs DDP; no per-forward AllGather.
HYBRID_SHARD: FULL_SHARD within node-shaped subgroup, replicated across subgroups; the recommended default for multi-node pods.
NO_SHARD: equivalent to DDP — a sanity-check baseline, not a production choice.
MixedPrecisionPolicy: BF16 working dtype + FP32 reduce dtype is the safe default; FP16 with loss scaling is only for Volta/Turing hardware.
Auto-wrap policy: wrap at transformer-block granularity (`LlamaDecoderLayer`, `MistralDecoderLayer`); module-by-module wrapping makes AllGather overhead dominate.
CPU offload: `CPUOffload(offload_params=True)` moves parameter shards to CPU when not in use — frees GPU memory at the cost of PCIe traffic.

FSDP2's per-parameter DTensor sharding is what makes composition with tensor parallelism work cleanly. In FSDP1, combining FSDP with Megatron-style TP required hand-written glue; in FSDP2, you build a 2D device mesh (`dp x tp`) and shard naturally — the recommended path for 70B+ models on PyTorch.

Reference and specifications#

FSDP is configured by the call to `fully_shard()` (FSDP2) or the `FullyShardedDataParallel` constructor (FSDP1) and through the surrounding mixed-precision / wrapping / offload policy objects. The table below documents the surface that matters for production runs as of PyTorch 2.4 (June 2026). Fields marked FSDP2-only do not exist on FSDP1; FSDP1 equivalents (`MixedPrecision`, `sharding_strategy=ShardingStrategy.FULL_SHARD`) are noted where the migration is non-obvious.

Field	API	Default	Description
mesh	fully_shard(mesh=...)	(required)	FSDP2 DeviceMesh defining the DP group; build with init_device_mesh.
sharding_strategy	FSDP1 ctor	FULL_SHARD	FULL_SHARD \| SHARD_GRAD_OP \| HYBRID_SHARD \| NO_SHARD.
mp_policy.param_dtype	MixedPrecisionPolicy	(unset)	Working dtype for gathered weights; torch.bfloat16 recommended.
mp_policy.reduce_dtype	MixedPrecisionPolicy	= param_dtype	Gradient ReduceScatter dtype; torch.float32 for stability.
mp_policy.output_dtype	MixedPrecisionPolicy	(unset)	Dtype returned to user code; usually omitted.
mp_policy.cast_forward_inputs	MixedPrecisionPolicy	true	Cast forward inputs to param_dtype automatically.
auto_wrap_policy	FSDP1 ctor	(off)	Policy that selects which submodules become wrapping units.
transformer_auto_wrap_policy	torch.distributed.fsdp.wrap	(helper)	Wrap at transformer-block granularity; supply `transformer_layer_cls`.
size_based_auto_wrap_policy	torch.distributed.fsdp.wrap	(helper)	Wrap units when parameter count exceeds threshold.
backward_prefetch	FSDP1 ctor / FSDP2 set_modules_to_forward_prefetch	BACKWARD_PRE	BACKWARD_PRE \| BACKWARD_POST \| NO_PREFETCH; PRE overlaps best.
forward_prefetch	FSDP1 ctor	false	Issue next unit's AllGather during current unit's forward.
use_orig_params	FSDP1 ctor	true (FSDP2)	Preserve original parameter objects (required for LoRA, per-param LR groups).
cpu_offload	CPUOffload(offload_params=True)	(off)	Spill parameter shards to CPU RAM; pairs with pinned memory.
limit_all_gathers	FSDP1 ctor	true	Throttle outstanding AllGathers to one at a time (memory cap).
state_dict_type	FSDP.set_state_dict_type	FULL_STATE_DICT	FULL_STATE_DICT \| SHARDED_STATE_DICT \| LOCAL_STATE_DICT.
sharded checkpoint format	torch.distributed.checkpoint	torch_dist	Parallel sharded save/load; required for 30B+ checkpoints.
fsdp_state_dict_type (Accelerate)	fsdp.yaml	SHARDED_STATE_DICT	Use SHARDED for checkpoints; FULL only for HF export.
fsdp_offload_params (Accelerate)	fsdp.yaml	false	Equivalent to CPUOffload(offload_params=True).
fsdp_sync_module_states	Accelerate	true	Broadcast rank-0 weights to all ranks at init.
fsdp_use_orig_params	Accelerate	true	Required for LoRA / DPO / partial-param training.
fsdp_cpu_ram_efficient_loading	Accelerate	true	Load weights on rank 0 only, then shard — avoids OOM on huge models.
fsdp_activation_checkpointing	Accelerate	false	Apply torch.utils.checkpoint to wrapped units.
mesh_dim for HSDP	init_device_mesh("cuda", (replica, shard), ...)	(custom)	2D mesh: shard inside each replica group, replicate across replicas.
NCCL_BUFFSIZE	env var	default	Raise to 8388608 (8MB) for FSDP at high world size.
TORCH_DISTRIBUTED_DEBUG	env var	OFF	DETAIL emits per-collective timing; INFO emits collective lifecycle.

`state_dict_type=FULL_STATE_DICT` materialises the entire model on rank 0 at save time — at 70B BF16 that is 140 GB of GPU memory just to save. Use SHARDED_STATE_DICT for routine checkpoints, then convert to FULL only when exporting to HuggingFace (via `torch.distributed.checkpoint.load` followed by an offline gather).

Workload patterns#

Three workload shapes cover the bulk of FSDP production usage: full-parameter SFT of 7B-13B models on a single 8-GPU node, full-parameter SFT or continued pretraining of 70B on a 4-node HSDP pod, and QLoRA fine-tuning of a 70B model on a single 8x H100 node with CPU offload. Each maps to a different config emphasis, and each maps to a different Yobitel NeoCloud training-pod shape — Pattern A on a single 8x H100 SXM5 node, Pattern B on a 32-GPU (4-node) NeoCloud HSDP pod, Pattern C on a single 8x H100 node with the Yobibyte managed fine-tune service.

Pattern A — FULL_SHARD SFT of Llama 3.1 8B on a single 8x H100 node, the standard small-model fine-tuning shape. Pattern B — HYBRID_SHARD continued-pretrain of Llama 3.1 70B on a 32-GPU NeoCloud pod, with FULL_SHARD inside each node and replicated AllReduce across nodes. Pattern C — QLoRA on Llama 3.1 70B with CPU offload, the cheapest 70B fine-tune shape on a single 8x H100 box; this is the shape Yobibyte's managed fine-tune service offers as a turnkey recipe for customers who do not want to operate FSDP themselves.

python

# A — FULL_SHARD SFT of Llama 3.1 8B on a single 8x H100 node
from torch.distributed._composable.fsdp import fully_shard, MixedPrecisionPolicy
from torch.distributed.device_mesh import init_device_mesh

mesh = init_device_mesh("cuda", (8,), mesh_dim_names=("dp",))
mp = MixedPrecisionPolicy(param_dtype=torch.bfloat16,
                          reduce_dtype=torch.float32)
for layer in model.model.layers:
    fully_shard(layer, mesh=mesh, mp_policy=mp)
fully_shard(model, mesh=mesh, mp_policy=mp)
# Optimiser sees sharded params; AdamW(fused=True) is the standard choice.


# B — HYBRID_SHARD continued-pretrain of Llama 3.1 70B on 32 GPUs (4 nodes)
#     FULL_SHARD inside each 8-GPU node, replicated across the 4 nodes.
mesh = init_device_mesh("cuda", (4, 8), mesh_dim_names=("replicate", "shard"))
for layer in model.model.layers:
    fully_shard(layer, mesh=mesh["replicate", "shard"], mp_policy=mp,
                reshard_after_forward=True)
fully_shard(model, mesh=mesh["replicate", "shard"], mp_policy=mp)
# Effective layout: each node holds the full model in sharded form;
# inter-node AllReduce only runs on gradients (DDP-style).


# C — QLoRA on Llama 3.1 70B with CPU offload (Yobibyte managed shape)
from torch.distributed.fsdp import CPUOffload, FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
import functools

# Apply 4-bit quantisation + LoRA adapters first, then wrap with FSDP1.
auto_wrap = functools.partial(transformer_auto_wrap_policy,
                              transformer_layer_cls={LlamaDecoderLayer})
model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    auto_wrap_policy=auto_wrap,
    mixed_precision=MixedPrecision(param_dtype=torch.bfloat16,
                                   reduce_dtype=torch.float32),
    cpu_offload=CPUOffload(offload_params=True),
    use_orig_params=True,           # required for PEFT/LoRA
    limit_all_gathers=True,
)

Sizing and capacity planning#

The questions that drive FSDP sizing are: which sharding strategy fits the model, what per-GPU throughput will I get, and where does CPU offload help versus hurt? The table below gives reference configs and observed throughput for BF16 AdamW-trained Llama-family architectures at 4K sequence length, with `mp_policy=BF16/FP32`, `backward_prefetch=BACKWARD_PRE`, FlashAttention-2 on, and activation checkpointing on. Throughput is per-GPU sustained tokens-per-second from Yobitel-operated NeoCloud fleets and published Meta / HF benchmarks.

FULL_SHARD is the right default through 70B on 8 GPUs; above 70B, compose with TP via FSDP2 + DTensor.
HYBRID_SHARD beats FULL_SHARD on multi-node pods because the cross-node AllGather is the main cost — keep AllGather inside the NVLink island.
CPU offload costs 2-3x throughput; only worth it when no other capacity is available or when running QLoRA at the limit.
Activation checkpointing roughly halves activation memory at ~30 percent FLOP cost — required for 70B at long context on 80 GB cards.
FlashAttention-2/3 is mandatory at scale; without it, attention activations dominate before FSDP can help.
Effective batch = micro_batch x DP x grad_accum; aim for 1-4M tokens per step for stable AdamW.

Model	Cluster	Strategy	Per-rank GPU mem	Per-GPU tok/s
7B (Llama 3.1)	8x A100 80GB	FULL_SHARD	~36 GB	8,200
7B	8x H100 80GB	FULL_SHARD	~32 GB	14,800
8B	8x H100	FULL_SHARD	~38 GB	13,500
13B	8x H100	FULL_SHARD	~52 GB	8,400
70B	8x H100	FULL_SHARD	~74 GB	2,400
70B	8x H100 + CPU offload	FULL_SHARD + CPUOffload	~58 GB	950
70B (QLoRA)	8x H100 + CPU offload	FULL_SHARD + CPUOffload + 4-bit	~44 GB	1,800
70B	32x H100 (4 nodes)	HYBRID_SHARD	~52 GB	2,650
70B	64x H100 (8 nodes)	HYBRID_SHARD	~46 GB	2,700
70B + TP=2	32x H100, FSDP2 + TP	HSDP x TP=2	~38 GB	2,900
8x7B MoE (Mixtral)	16x H100	FULL_SHARD + EP=4	~62 GB	2,200

Limits and quotas#

FSDP itself imposes few hard limits; the constraints come from GPU memory, NCCL configuration, the auto-wrap policy, and the checkpoint pipeline. The table below summarises the operational ceilings worth knowing before designing an FSDP run.

Constraint	Default / ceiling	How to manage
DP group size	= world_size (FULL_SHARD)	No hard upper bound; AllGather amortises poorly past ~512.
HSDP replica x shard product	= world_size	Replica size = nodes, shard size = GPUs per node typical.
Wrapping unit count	Policy-dependent	Transformer-block granularity is the production default.
Per-unit AllGather memory	Working dtype x unit params	BF16 70B block ~280 MB — comfortable at limit_all_gathers=True.
CPU RAM (offload)	Host-bounded	Need ~16 bytes/param of CPU RAM for offload + AdamW state.
Checkpoint save (FULL)	2x model size on rank 0	Switch to SHARDED_STATE_DICT for routine saves.
NCCL communicator count	World-size dependent	Set NCCL_BUFFSIZE=8388608, NCCL_NTHREADS=2; use NCCL >= 2.20.
use_orig_params	true (FSDP2)	Required for LoRA, partial-freeze, per-param LR groups.
FSDP + TP composition	FSDP2 only	Use 2D device mesh; FSDP1 path is brittle.
FSDP + PP composition	FSDP2 + torch.distributed.pipelining	Stage boundaries cannot cross FSDP wrapping units.
BF16 hardware	Ampere+	Pre-Ampere needs FP16 + dynamic loss scaling (use torch.cuda.amp).
Single-step wallclock	1-15s typical	Iteration time should hold within +-5 percent in steady state.

Observability#

FSDP emits diagnostic information through standard PyTorch hooks plus a few FSDP-specific ones: `FSDP.summon_full_params()` for inspection, `set_state_dict_type` for checkpoint shape, and the optional `TORCH_DISTRIBUTED_DEBUG=DETAIL` env var that traces per-collective timing. In production you pair this with TensorBoard / W&B for loss curves, DCGM exporter for GPU-level metrics, and NCCL profiling for collective-level diagnostics. The signals that detect 90 percent of FSDP production issues are throughput drift, AllGather p99 latency, and grad-norm sanity.

samples_per_second / tokens_per_second: holds within +-5 percent in steady state; drops correlate with AllGather under-overlap.
AllGather p99 / ReduceScatter p99: visible under TORCH_DISTRIBUTED_DEBUG=DETAIL; >100ms inside an NVLink island means topology regression.
grad_norm: spikes above 10x baseline indicate divergence — almost always LR or data, not FSDP.
GPU memory peak per step: log with torch.cuda.max_memory_allocated() to validate wrap policy.
DCGM_FI_PROF_NVLINK_TX_BYTES vs DCGM_FI_PROF_PCIE_TX_BYTES: separates FSDP collectives (NVLink) from CPU offload traffic (PCIe).
torch.profiler with NCCL traces: the right tool for diagnosing 'why is AllGather not overlapping'.
Checkpoint save duration: a sudden jump usually means FULL_STATE_DICT accidentally got re-enabled.

yaml

# Prometheus rules for an FSDP training job
groups:
  - name: fsdp-training
    interval: 60s
    rules:
      - alert: FSDPStepTimeRegression
        expr: |
          avg_over_time(fsdp:step_time_seconds[5m]) >
          1.10 * avg_over_time(fsdp:step_time_seconds[1h] offset 30m)
        for: 10m
        labels: { severity: warning, team: training }
        annotations:
          summary: "FSDP step time +10% vs 1h baseline on {{ $labels.job_name }}"

      - alert: FSDPAllGatherP99
        expr: histogram_quantile(0.99,
                rate(fsdp:allgather_seconds_bucket[5m])) > 0.2
        for: 10m
        annotations:
          summary: "FSDP AllGather p99 > 200ms — fabric or wrap-policy regression"

      - alert: FSDPGradNormSpike
        expr: fsdp:grad_norm > 10 * avg_over_time(fsdp:grad_norm[1h] offset 30m)
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "FSDP grad-norm spike — investigate LR or data corruption"

Cost and FinOps#

FSDP cost is a function of throughput tax for the chosen sharding strategy and the hardware mix. FULL_SHARD on a single 8-GPU node is within ~5 percent of DDP throughput where DDP fits; CPU offload typically costs 2-3x in wall-clock; HSDP across nodes preserves most of FULL_SHARD's per-GPU throughput. The table below uses Yobitel NeoCloud list pricing (June 2026) for the canonical 70B fine-tuning shape (Llama 3.1 70B SFT on 8x H100 SXM5) and the alternative shapes.

Single-node FSDP with FULL_SHARD beats DeepSpeed ZeRO-3 on the same hardware by ~5-10 percent on routine 7B-13B fine-tunes — same math, tighter PyTorch integration.
QLoRA is the cheapest 70B fine-tune path on a single 8x H100 node; the Yobibyte managed fine-tune service exposes this as a turnkey recipe at the same NeoCloud node rate.
HSDP across 4 nodes finishes the same job ~5x faster than single-node FULL_SHARD at the same USD-per-token — buy more nodes for time-to-result, not for cost.
Activation checkpointing trades ~30 percent FLOPs for ~5x activation memory — usually still cheaper than scaling out.
Reserved capacity (Yobitel NeoCloud 1-3 year terms) drops the rate 35-40 percent for predictable multi-month runs.

Config	Node shape	Hours / 1B tokens	Hourly rate	USD / 1B tokens
FULL_SHARD, no offload	8x H100 80GB	50	$24.80	$1,240
FULL_SHARD + CPU offload	8x H100 + 1TB DDR5	125	$28.40	$3,550
QLoRA + CPU offload	8x H100 + 1TB DDR5	65	$28.40	$1,846
HYBRID_SHARD (32 GPU pod)	32x H100 (4 nodes)	11	$99.20	$1,091
HSDP x TP=2 (32 GPU pod)	32x H100 (4 nodes)	10	$99.20	$992
DeepSpeed ZeRO-3 equivalent	8x H100 80GB	55	$24.80	$1,364

Security and compliance#

FSDP is a training library; security and compliance apply at the cluster boundary the same way they do for DeepSpeed or Megatron. The library itself is BSD-licensed and ships inside PyTorch with no telemetry call-home. For UK and EU sovereign deployments, FSDP runs inside the same NCSC Cloud Security Principles- and G-Cloud 14-aligned Yobitel NeoCloud tenancies as the other training engines — the framework choice does not change the sovereignty posture.

Two operational notes for compliance audits. First, FSDP sharded checkpoints (`__N_N.distcp` files in the torch_dist format) are not portable across world sizes without an explicit re-shard step; use `torch.distributed.checkpoint.load` or `torch.distributed.checkpoint.format_utils.dcp_to_torch_save` to materialise a single state dict for audit retention. Second, CPU-offload spills sensitive weights to host RAM and (when pin_memory is on) keeps them in a pinned region that survives container restart; ensure the host environment is dedicated and that ephemeral storage is encrypted on Yobitel NeoCloud sovereign tenancies.

Migration and alternatives#

FSDP and DeepSpeed ZeRO have converged functionally at the math layer — FSDP FULL_SHARD and ZeRO-3 give the same memory and throughput on the same workload. The meaningful differences are ergonomic and ecosystem. FSDP has tighter PyTorch core integration, native DTensor composition with TP, and a cleaner extension story via FSDP2 (`torch.distributed._composable.fsdp.fully_shard`). DeepSpeed has the richer offload story (CPU and NVMe via ZeRO-Infinity), the Megatron-DeepSpeed hybrid, and an established HuggingFace Trainer config surface. For frontier 100B+ training, Megatron-LM with 3D parallelism is the production default; FSDP composes with TP via DTensor but the recipe ecosystem is thinner.

From / To	Effort	When to choose	What you keep / lose
DDP -> FSDP FULL_SHARD	Low — wrap + config	Model > 1 GPU memory.	Keep DDP semantics; gain N-fold memory headroom.
FSDP1 -> FSDP2	Medium — code rewrite	Need TP composition or per-param introspection.	Lose FlatParameter; gain DTensor + clean PEFT integration.
DeepSpeed ZeRO-3 -> FSDP	Medium — config + code	PyTorch-native stack, no NVMe offload needed.	Lose ZeRO-Infinity; gain FSDP2 + torchtune ergonomics.
FSDP -> DeepSpeed	Medium — config + launcher	Need NVMe offload or Megatron-DeepSpeed hybrid.	Gain offload tier; lose FSDP2 ergonomics.
FSDP -> Megatron-LM	High — different paradigm	True 3D parallelism at frontier scale.	Gain TP+PP+SP+CP; lose PyTorch-native ergonomics.
FSDP -> Yobibyte managed fine-tune	Trivial — submit job	Don't want to operate FSDP at all.	Lose direct flag access; gain pre-validated QLoRA / SFT recipes on Yobitel NeoCloud.

If you are starting fresh in 2026 on PyTorch and don't need NVMe offload, prefer FSDP2 over DeepSpeed ZeRO. The two are equivalent at the math layer, but FSDP2 composes more cleanly with the rest of the torch.distributed stack. If you want zero ops burden, Yobibyte's managed fine-tune service wraps FSDP under the hood and exposes a Llama / Qwen QLoRA recipe surface — same engine, no cluster operation.

Troubleshooting#

The error patterns below cover the common FSDP failure modes observed on Yobitel-operated training fleets and the public PyTorch issue tracker. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix.

Symptom	Cause	Fix
OOM during model load before training	Full model materialised on every rank before sharding.	Set fsdp_cpu_ram_efficient_loading=True or use torch.distributed.checkpoint.load.
LoRA / PEFT parameters do not update	use_orig_params=False flattens params; PEFT cannot find adapters.	Set use_orig_params=True (default in FSDP2).
AllGather not overlapping with compute	backward_prefetch=NO_PREFETCH or wrap policy too fine.	Set backward_prefetch=BACKWARD_PRE; wrap at transformer-block granularity.
Throughput dies past one node	FULL_SHARD AllGather over IB dominates.	Switch to HYBRID_SHARD with replicate dim = nodes.
from_pretrained fails on FSDP checkpoint	SHARDED_STATE_DICT cannot load into HF directly.	Convert with torch.distributed.checkpoint.format_utils.dcp_to_torch_save.
CPU offload step >10s per micro-batch	fused AdamW not used or PCIe Gen3 link.	Set fused=True on optimiser; verify PCIe Gen4 or Gen5.
NaN loss with FP16 mixed precision	Dynamic loss scale collapsed.	Switch to BF16 (mp_policy.param_dtype=torch.bfloat16).
Grad-norm spikes only on some ranks	Per-rank data divergence (bad shard or stale dataloader seed).	Verify deterministic dataloader sharding; check seed.
Checkpoint save takes >10 minutes	FULL_STATE_DICT gather on rank 0.	Switch to SHARDED_STATE_DICT + torch_dist async writes.
FSDP + TP shape error	Using FSDP1 instead of FSDP2 for 2D mesh.	Migrate to fully_shard() with init_device_mesh 2D mesh.
Hanging at init_process_group on multi-node	NCCL not finding the right NIC.	Set NCCL_SOCKET_IFNAME and NCCL_IB_HCA explicitly.
grad-norm zero on some steps	Activation checkpointing recomputing into FP16 with no autocast.	Wrap checkpoint function with torch.cuda.amp.autocast(dtype=torch.bfloat16).

Where this fits in the Yobitel stack#

FSDP is one of the supported training engines on Yobitel NeoCloud sovereign tenancies. For PyTorch-native customers running 7B-70B fine-tunes on 8-128 GPU pods, the canonical path is FSDP2 with FULL_SHARD inside an 8-GPU node and HYBRID_SHARD across multi-node pods, BF16 mixed precision, FlashAttention-2/3, and SHARDED_STATE_DICT checkpoints landing on the pod's parallel filesystem. Pre-built NGC-derived containers ship PyTorch 2.4+, the NCCL and Transformer Engine versions tested against the Yobitel NeoCloud H100 / H200 / B200 fleets, and DCGM observability dashboards pre-applied.

For customers who want the QLoRA / SFT recipe without operating FSDP directly, Yobibyte's managed fine-tune service wraps FSDP under the hood and exposes a higher-level Llama / Qwen / Mistral recipe surface backed by the same NeoCloud capacity — same engine, no cluster operation. Trained checkpoints flow into Yobibyte's inference path (industry-standard runtimes) after the standard HuggingFace conversion, and customer-tuned weights stay in the customer's sovereign region throughout the lifecycle.

References

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel · arXiv (Zhao et al., 2023)
PyTorch FSDP documentation · PyTorch
FSDP2: per-parameter sharding with DTensor · PyTorch 2.4
Introducing PyTorch Fully Sharded Data Parallel API · PyTorch Blog (Meta)
HuggingFace Accelerate FSDP integration · HuggingFace
torchtune · GitHub (Meta)

TL;DR

Built into PyTorch (`torch.distributed.fsdp`), FSDP shards model parameters, gradients, and optimiser state across data-parallel workers — architecturally equivalent to DeepSpeed ZeRO-3 at the FULL_SHARD setting. BSD-licensed and shipped with every PyTorch wheel.
Originated from Meta's FairScale FullyShardedDataParallel (2021), upstreamed to core PyTorch 1.11 (2022); FSDP2 (PyTorch 2.4, 2024) rewrote it on top of DTensor with per-parameter sharding instead of flat parameters, making composition with TP, PP, and EP first-class.
Default choice for the PyTorch ecosystem when 3D parallelism is overkill — wired into HuggingFace Trainer, Accelerate, torchtune, Lightning, and TRL. The standard 8-128 GPU training and fine-tuning surface on Yobitel NeoCloud.
Tuning surface is small but meaningful: `auto_wrap_policy`, `sharding_strategy` (FULL_SHARD / SHARD_GRAD_OP / HYBRID_SHARD / NO_SHARD), `mixed_precision` (BF16 by default), `backward_prefetch`, `cpu_offload`, `use_orig_params`. Getting these right is most of the FSDP ops effort.

Overview#

Quick start#

bash

# 1. Install PyTorch 2.4+ (FSDP2) and the model libraries
pip install "torch>=2.4" transformers accelerate datasets

cat > train_fsdp2.py <<'PY'
import os
import functools
import torch
import torch.distributed as dist
from torch.distributed._composable.fsdp import fully_shard, MixedPrecisionPolicy
from torch.distributed.device_mesh import init_device_mesh
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.models.llama.modeling_llama import LlamaDecoderLayer

dist.init_process_group(backend="nccl")
rank = dist.get_rank(); world = dist.get_world_size()
torch.cuda.set_device(rank % torch.cuda.device_count())

mesh = init_device_mesh("cuda", (world,), mesh_dim_names=("dp",))
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

mp_policy = MixedPrecisionPolicy(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.float32,   # keep gradient reduction in FP32 for stability
)

# Shard every transformer block, then shard the root model
for layer in model.model.layers:
    fully_shard(layer, mesh=mesh, mp_policy=mp_policy)
fully_shard(model, mesh=mesh, mp_policy=mp_policy)

optim = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.0,
                          fused=True)
# ... training loop: loss.backward() and optim.step() use the sharded params.
PY

# 2. Launch on 8x H100 SXM5
torchrun --standalone --nproc_per_node=8 train_fsdp2.py

# 3. Equivalent HuggingFace Accelerate config (fsdp.yaml)
cat > fsdp.yaml <<'YAML'
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
mixed_precision: bf16
num_machines: 1
num_processes: 8
fsdp_config:
  fsdp_version: 2
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_use_orig_params: true
  fsdp_offload_params: false
YAML
accelerate launch --config_file fsdp.yaml sft.py

How it works#

FULL_SHARD: params + grads + optimiser state sharded; ~N-fold memory cut; AllGather per forward + per backward + ReduceScatter per backward.
SHARD_GRAD_OP: grads + optimiser state sharded, params replicated; ~50 percent memory cut vs DDP; no per-forward AllGather.
HYBRID_SHARD: FULL_SHARD within node-shaped subgroup, replicated across subgroups; the recommended default for multi-node pods.
NO_SHARD: equivalent to DDP — a sanity-check baseline, not a production choice.
MixedPrecisionPolicy: BF16 working dtype + FP32 reduce dtype is the safe default; FP16 with loss scaling is only for Volta/Turing hardware.
Auto-wrap policy: wrap at transformer-block granularity (`LlamaDecoderLayer`, `MistralDecoderLayer`); module-by-module wrapping makes AllGather overhead dominate.
CPU offload: `CPUOffload(offload_params=True)` moves parameter shards to CPU when not in use — frees GPU memory at the cost of PCIe traffic.

Reference and specifications#

Field	API	Default	Description
mesh	fully_shard(mesh=...)	(required)	FSDP2 DeviceMesh defining the DP group; build with init_device_mesh.
sharding_strategy	FSDP1 ctor	FULL_SHARD	FULL_SHARD \| SHARD_GRAD_OP \| HYBRID_SHARD \| NO_SHARD.
mp_policy.param_dtype	MixedPrecisionPolicy	(unset)	Working dtype for gathered weights; torch.bfloat16 recommended.
mp_policy.reduce_dtype	MixedPrecisionPolicy	= param_dtype	Gradient ReduceScatter dtype; torch.float32 for stability.
mp_policy.output_dtype	MixedPrecisionPolicy	(unset)	Dtype returned to user code; usually omitted.
mp_policy.cast_forward_inputs	MixedPrecisionPolicy	true	Cast forward inputs to param_dtype automatically.
auto_wrap_policy	FSDP1 ctor	(off)	Policy that selects which submodules become wrapping units.
transformer_auto_wrap_policy	torch.distributed.fsdp.wrap	(helper)	Wrap at transformer-block granularity; supply `transformer_layer_cls`.
size_based_auto_wrap_policy	torch.distributed.fsdp.wrap	(helper)	Wrap units when parameter count exceeds threshold.
backward_prefetch	FSDP1 ctor / FSDP2 set_modules_to_forward_prefetch	BACKWARD_PRE	BACKWARD_PRE \| BACKWARD_POST \| NO_PREFETCH; PRE overlaps best.
forward_prefetch	FSDP1 ctor	false	Issue next unit's AllGather during current unit's forward.
use_orig_params	FSDP1 ctor	true (FSDP2)	Preserve original parameter objects (required for LoRA, per-param LR groups).
cpu_offload	CPUOffload(offload_params=True)	(off)	Spill parameter shards to CPU RAM; pairs with pinned memory.
limit_all_gathers	FSDP1 ctor	true	Throttle outstanding AllGathers to one at a time (memory cap).
state_dict_type	FSDP.set_state_dict_type	FULL_STATE_DICT	FULL_STATE_DICT \| SHARDED_STATE_DICT \| LOCAL_STATE_DICT.
sharded checkpoint format	torch.distributed.checkpoint	torch_dist	Parallel sharded save/load; required for 30B+ checkpoints.
fsdp_state_dict_type (Accelerate)	fsdp.yaml	SHARDED_STATE_DICT	Use SHARDED for checkpoints; FULL only for HF export.
fsdp_offload_params (Accelerate)	fsdp.yaml	false	Equivalent to CPUOffload(offload_params=True).
fsdp_sync_module_states	Accelerate	true	Broadcast rank-0 weights to all ranks at init.
fsdp_use_orig_params	Accelerate	true	Required for LoRA / DPO / partial-param training.
fsdp_cpu_ram_efficient_loading	Accelerate	true	Load weights on rank 0 only, then shard — avoids OOM on huge models.
fsdp_activation_checkpointing	Accelerate	false	Apply torch.utils.checkpoint to wrapped units.
mesh_dim for HSDP	init_device_mesh("cuda", (replica, shard), ...)	(custom)	2D mesh: shard inside each replica group, replicate across replicas.
NCCL_BUFFSIZE	env var	default	Raise to 8388608 (8MB) for FSDP at high world size.
TORCH_DISTRIBUTED_DEBUG	env var	OFF	DETAIL emits per-collective timing; INFO emits collective lifecycle.

Workload patterns#

python

# A — FULL_SHARD SFT of Llama 3.1 8B on a single 8x H100 node
from torch.distributed._composable.fsdp import fully_shard, MixedPrecisionPolicy
from torch.distributed.device_mesh import init_device_mesh

mesh = init_device_mesh("cuda", (8,), mesh_dim_names=("dp",))
mp = MixedPrecisionPolicy(param_dtype=torch.bfloat16,
                          reduce_dtype=torch.float32)
for layer in model.model.layers:
    fully_shard(layer, mesh=mesh, mp_policy=mp)
fully_shard(model, mesh=mesh, mp_policy=mp)
# Optimiser sees sharded params; AdamW(fused=True) is the standard choice.


# B — HYBRID_SHARD continued-pretrain of Llama 3.1 70B on 32 GPUs (4 nodes)
#     FULL_SHARD inside each 8-GPU node, replicated across the 4 nodes.
mesh = init_device_mesh("cuda", (4, 8), mesh_dim_names=("replicate", "shard"))
for layer in model.model.layers:
    fully_shard(layer, mesh=mesh["replicate", "shard"], mp_policy=mp,
                reshard_after_forward=True)
fully_shard(model, mesh=mesh["replicate", "shard"], mp_policy=mp)
# Effective layout: each node holds the full model in sharded form;
# inter-node AllReduce only runs on gradients (DDP-style).


# C — QLoRA on Llama 3.1 70B with CPU offload (Yobibyte managed shape)
from torch.distributed.fsdp import CPUOffload, FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
import functools

# Apply 4-bit quantisation + LoRA adapters first, then wrap with FSDP1.
auto_wrap = functools.partial(transformer_auto_wrap_policy,
                              transformer_layer_cls={LlamaDecoderLayer})
model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    auto_wrap_policy=auto_wrap,
    mixed_precision=MixedPrecision(param_dtype=torch.bfloat16,
                                   reduce_dtype=torch.float32),
    cpu_offload=CPUOffload(offload_params=True),
    use_orig_params=True,           # required for PEFT/LoRA
    limit_all_gathers=True,
)

Sizing and capacity planning#

FULL_SHARD is the right default through 70B on 8 GPUs; above 70B, compose with TP via FSDP2 + DTensor.
HYBRID_SHARD beats FULL_SHARD on multi-node pods because the cross-node AllGather is the main cost — keep AllGather inside the NVLink island.
CPU offload costs 2-3x throughput; only worth it when no other capacity is available or when running QLoRA at the limit.
Activation checkpointing roughly halves activation memory at ~30 percent FLOP cost — required for 70B at long context on 80 GB cards.
FlashAttention-2/3 is mandatory at scale; without it, attention activations dominate before FSDP can help.
Effective batch = micro_batch x DP x grad_accum; aim for 1-4M tokens per step for stable AdamW.

Model	Cluster	Strategy	Per-rank GPU mem	Per-GPU tok/s
7B (Llama 3.1)	8x A100 80GB	FULL_SHARD	~36 GB	8,200
7B	8x H100 80GB	FULL_SHARD	~32 GB	14,800
8B	8x H100	FULL_SHARD	~38 GB	13,500
13B	8x H100	FULL_SHARD	~52 GB	8,400
70B	8x H100	FULL_SHARD	~74 GB	2,400
70B	8x H100 + CPU offload	FULL_SHARD + CPUOffload	~58 GB	950
70B (QLoRA)	8x H100 + CPU offload	FULL_SHARD + CPUOffload + 4-bit	~44 GB	1,800
70B	32x H100 (4 nodes)	HYBRID_SHARD	~52 GB	2,650
70B	64x H100 (8 nodes)	HYBRID_SHARD	~46 GB	2,700
70B + TP=2	32x H100, FSDP2 + TP	HSDP x TP=2	~38 GB	2,900
8x7B MoE (Mixtral)	16x H100	FULL_SHARD + EP=4	~62 GB	2,200

Limits and quotas#

Constraint	Default / ceiling	How to manage
DP group size	= world_size (FULL_SHARD)	No hard upper bound; AllGather amortises poorly past ~512.
HSDP replica x shard product	= world_size	Replica size = nodes, shard size = GPUs per node typical.
Wrapping unit count	Policy-dependent	Transformer-block granularity is the production default.
Per-unit AllGather memory	Working dtype x unit params	BF16 70B block ~280 MB — comfortable at limit_all_gathers=True.
CPU RAM (offload)	Host-bounded	Need ~16 bytes/param of CPU RAM for offload + AdamW state.
Checkpoint save (FULL)	2x model size on rank 0	Switch to SHARDED_STATE_DICT for routine saves.
NCCL communicator count	World-size dependent	Set NCCL_BUFFSIZE=8388608, NCCL_NTHREADS=2; use NCCL >= 2.20.
use_orig_params	true (FSDP2)	Required for LoRA, partial-freeze, per-param LR groups.
FSDP + TP composition	FSDP2 only	Use 2D device mesh; FSDP1 path is brittle.
FSDP + PP composition	FSDP2 + torch.distributed.pipelining	Stage boundaries cannot cross FSDP wrapping units.
BF16 hardware	Ampere+	Pre-Ampere needs FP16 + dynamic loss scaling (use torch.cuda.amp).
Single-step wallclock	1-15s typical	Iteration time should hold within +-5 percent in steady state.

Observability#

samples_per_second / tokens_per_second: holds within +-5 percent in steady state; drops correlate with AllGather under-overlap.
AllGather p99 / ReduceScatter p99: visible under TORCH_DISTRIBUTED_DEBUG=DETAIL; >100ms inside an NVLink island means topology regression.
grad_norm: spikes above 10x baseline indicate divergence — almost always LR or data, not FSDP.
GPU memory peak per step: log with torch.cuda.max_memory_allocated() to validate wrap policy.
DCGM_FI_PROF_NVLINK_TX_BYTES vs DCGM_FI_PROF_PCIE_TX_BYTES: separates FSDP collectives (NVLink) from CPU offload traffic (PCIe).
torch.profiler with NCCL traces: the right tool for diagnosing 'why is AllGather not overlapping'.
Checkpoint save duration: a sudden jump usually means FULL_STATE_DICT accidentally got re-enabled.

yaml

# Prometheus rules for an FSDP training job
groups:
  - name: fsdp-training
    interval: 60s
    rules:
      - alert: FSDPStepTimeRegression
        expr: |
          avg_over_time(fsdp:step_time_seconds[5m]) >
          1.10 * avg_over_time(fsdp:step_time_seconds[1h] offset 30m)
        for: 10m
        labels: { severity: warning, team: training }
        annotations:
          summary: "FSDP step time +10% vs 1h baseline on {{ $labels.job_name }}"

      - alert: FSDPAllGatherP99
        expr: histogram_quantile(0.99,
                rate(fsdp:allgather_seconds_bucket[5m])) > 0.2
        for: 10m
        annotations:
          summary: "FSDP AllGather p99 > 200ms — fabric or wrap-policy regression"

      - alert: FSDPGradNormSpike
        expr: fsdp:grad_norm > 10 * avg_over_time(fsdp:grad_norm[1h] offset 30m)
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "FSDP grad-norm spike — investigate LR or data corruption"

Cost and FinOps#

Single-node FSDP with FULL_SHARD beats DeepSpeed ZeRO-3 on the same hardware by ~5-10 percent on routine 7B-13B fine-tunes — same math, tighter PyTorch integration.
QLoRA is the cheapest 70B fine-tune path on a single 8x H100 node; the Yobibyte managed fine-tune service exposes this as a turnkey recipe at the same NeoCloud node rate.
HSDP across 4 nodes finishes the same job ~5x faster than single-node FULL_SHARD at the same USD-per-token — buy more nodes for time-to-result, not for cost.
Activation checkpointing trades ~30 percent FLOPs for ~5x activation memory — usually still cheaper than scaling out.
Reserved capacity (Yobitel NeoCloud 1-3 year terms) drops the rate 35-40 percent for predictable multi-month runs.

Config	Node shape	Hours / 1B tokens	Hourly rate	USD / 1B tokens
FULL_SHARD, no offload	8x H100 80GB	50	$24.80	$1,240
FULL_SHARD + CPU offload	8x H100 + 1TB DDR5	125	$28.40	$3,550
QLoRA + CPU offload	8x H100 + 1TB DDR5	65	$28.40	$1,846
HYBRID_SHARD (32 GPU pod)	32x H100 (4 nodes)	11	$99.20	$1,091
HSDP x TP=2 (32 GPU pod)	32x H100 (4 nodes)	10	$99.20	$992
DeepSpeed ZeRO-3 equivalent	8x H100 80GB	55	$24.80	$1,364

Security and compliance#

Migration and alternatives#

From / To	Effort	When to choose	What you keep / lose
DDP -> FSDP FULL_SHARD	Low — wrap + config	Model > 1 GPU memory.	Keep DDP semantics; gain N-fold memory headroom.
FSDP1 -> FSDP2	Medium — code rewrite	Need TP composition or per-param introspection.	Lose FlatParameter; gain DTensor + clean PEFT integration.
DeepSpeed ZeRO-3 -> FSDP	Medium — config + code	PyTorch-native stack, no NVMe offload needed.	Lose ZeRO-Infinity; gain FSDP2 + torchtune ergonomics.
FSDP -> DeepSpeed	Medium — config + launcher	Need NVMe offload or Megatron-DeepSpeed hybrid.	Gain offload tier; lose FSDP2 ergonomics.
FSDP -> Megatron-LM	High — different paradigm	True 3D parallelism at frontier scale.	Gain TP+PP+SP+CP; lose PyTorch-native ergonomics.
FSDP -> Yobibyte managed fine-tune	Trivial — submit job	Don't want to operate FSDP at all.	Lose direct flag access; gain pre-validated QLoRA / SFT recipes on Yobitel NeoCloud.

Troubleshooting#

Symptom	Cause	Fix
OOM during model load before training	Full model materialised on every rank before sharding.	Set fsdp_cpu_ram_efficient_loading=True or use torch.distributed.checkpoint.load.
LoRA / PEFT parameters do not update	use_orig_params=False flattens params; PEFT cannot find adapters.	Set use_orig_params=True (default in FSDP2).
AllGather not overlapping with compute	backward_prefetch=NO_PREFETCH or wrap policy too fine.	Set backward_prefetch=BACKWARD_PRE; wrap at transformer-block granularity.
Throughput dies past one node	FULL_SHARD AllGather over IB dominates.	Switch to HYBRID_SHARD with replicate dim = nodes.
from_pretrained fails on FSDP checkpoint	SHARDED_STATE_DICT cannot load into HF directly.	Convert with torch.distributed.checkpoint.format_utils.dcp_to_torch_save.
CPU offload step >10s per micro-batch	fused AdamW not used or PCIe Gen3 link.	Set fused=True on optimiser; verify PCIe Gen4 or Gen5.
NaN loss with FP16 mixed precision	Dynamic loss scale collapsed.	Switch to BF16 (mp_policy.param_dtype=torch.bfloat16).
Grad-norm spikes only on some ranks	Per-rank data divergence (bad shard or stale dataloader seed).	Verify deterministic dataloader sharding; check seed.
Checkpoint save takes >10 minutes	FULL_STATE_DICT gather on rank 0.	Switch to SHARDED_STATE_DICT + torch_dist async writes.
FSDP + TP shape error	Using FSDP1 instead of FSDP2 for 2D mesh.	Migrate to fully_shard() with init_device_mesh 2D mesh.
Hanging at init_process_group on multi-node	NCCL not finding the right NIC.	Set NCCL_SOCKET_IFNAME and NCCL_IB_HCA explicitly.
grad-norm zero on some steps	Activation checkpointing recomputing into FP16 with no autocast.	Wrap checkpoint function with torch.cuda.amp.autocast(dtype=torch.bfloat16).

Where this fits in the Yobitel stack#

References

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel · arXiv (Zhao et al., 2023)
PyTorch FSDP documentation · PyTorch
FSDP2: per-parameter sharding with DTensor · PyTorch 2.4
Introducing PyTorch Fully Sharded Data Parallel API · PyTorch Blog (Meta)
HuggingFace Accelerate FSDP integration · HuggingFace
torchtune · GitHub (Meta)

FSDP — Fully Sharded Data Parallel

Overview#

Quick start#

How it works#

Reference and specifications#

Workload patterns#

Sizing and capacity planning#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

FSDP — Fully Sharded Data Parallel

Overview#

Quick start#

How it works#

Reference and specifications#

Workload patterns#

Sizing and capacity planning#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel