NCCL (NVIDIA Collective Communications Library)

TL;DR

NCCL (pronounced 'nickel') is NVIDIA's open-source library of GPU-aware collective operations — AllReduce, AllGather, ReduceScatter, Broadcast, AllToAll, plus point-to-point Send/Recv — released under BSD 3-Clause and the default multi-GPU comms backend for every major training and inference framework.
Auto-discovers system topology (NVLink/NVSwitch, PCIe trees, NIC affinity, InfiniBand, RoCEv2) and selects ring / tree / NVLS / CollNet (SHARP) / PAT algorithms per collective, per message size, per cluster shape.
Used by PyTorch DDP and FSDP, DeepSpeed ZeRO, Megatron-LM, JAX/XLA, vLLM, TensorRT-LLM, SGLang, and almost every other multi-GPU runtime — the lingua franca of NVIDIA-GPU collectives.
Tuned via ~60 environment variables; the half-dozen that matter most are `NCCL_ALGO`, `NCCL_PROTO`, `NCCL_IB_HCA`, `NCCL_IB_GID_INDEX`, `NCCL_COLLNET_ENABLE`, `NCCL_P2P_LEVEL`, plus `NCCL_DEBUG=INFO` for diagnosis.
Performance rule of thumb: ring AllReduce on N GPUs reaches (N-1)/N x link bandwidth at large message sizes; tree/SHARP outperforms ring below ~16 MB and above ~512 GPUs; AllToAll for MoE all-to-all stages scales worst and is usually the first collective to bottleneck.

Overview#

The NVIDIA Collective Communications Library is the layer that turns a collection of GPUs into a distributed training fabric. It implements the standard MPI-style collective operations — AllReduce, AllGather, ReduceScatter, Broadcast, Reduce, AllToAll, plus point-to-point Send/Recv — but with GPU-direct memory access, hardware-aware topology discovery, and per-fabric algorithm selection. Where MPI was the lingua franca of CPU-centric HPC, NCCL is the lingua franca of GPU-centric AI.

Almost every distributed deep-learning framework calls into NCCL for multi-GPU communication: PyTorch's DistributedDataParallel and FSDP, DeepSpeed ZeRO-1/2/3, Megatron-LM, JAX/XLA via the XLA NCCL plugin, vLLM for tensor-parallel inference, TensorRT-LLM for the same, SGLang, NVIDIA NIM. Understanding NCCL is non-negotiable for anyone operating GPU clusters at scale — every silent performance regression and most hangs trace back to it.

Released in 2016 by NVIDIA, NCCL has had a steady minor-release cadence since: NCCL 2.18 introduced NVLS (NVLink SHARP) for Hopper, NCCL 2.20 added FP8 reductions for Transformer Engine, NCCL 2.21+ added PAT (Parallel Aggregated Trees) for AllGather/ReduceScatter at scale. As of 2026 the actively maintained line is NCCL 2.23+; older 2.x releases still ship in older CUDA images but should not be used for new builds.

NCCL is the collective layer Yobibyte uses on every multi-GPU workload — fine-tunes, batched inference, tensor-parallel serving — and the default comms backend on every Yobitel NeoCloud cluster image. This entry helps you operate NCCL in production and reach the line-rate behaviour you paid for, especially when a silent algorithm or NIC-affinity regression is the difference between a 10-day and a 14-day training run.

Quick start: minimal multi-GPU AllReduce in PyTorch#

The shortest path from zero to a working NCCL collective. The script below initialises a PyTorch process group with the NCCL backend across the visible GPUs on a single node, runs an AllReduce of a small tensor, and prints the result on every rank. Launch with `torchrun --nproc_per_node=auto allreduce.py`. Verify the NCCL log shows the expected NIC and algorithm — then move to the `nccl-tests` benchmarks for a real measurement.

Multi-node: replace `torchrun --nproc_per_node` with `torchrun --nnodes N --node_rank R --rdzv_endpoint <head>:29500`. SLURM users wrap in `srun` with the launcher of their choice.
First-time validation should always be `nccl-tests` AllReduce, not a real training job. See the `Sizing and capacity planning` section for expected throughput.
Set `NCCL_DEBUG=INFO` for the first run on any new cluster. The log shows the discovered topology, chosen algorithm and NIC binding — if anything looks wrong, fix it before launching production training.

python

# allreduce.py — minimal NCCL AllReduce smoke test
# Launch: NCCL_DEBUG=INFO torchrun --nproc_per_node=auto allreduce.py
import os
import torch
import torch.distributed as dist

def main() -> None:
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    world = dist.get_world_size()
    torch.cuda.set_device(rank % torch.cuda.device_count())

    x = torch.full((1 << 20,), float(rank), device="cuda")  # 4 MiB float32
    dist.all_reduce(x, op=dist.ReduceOp.SUM)

    expected = sum(range(world))
    print(f"[rank {rank}/{world}] AllReduce sum head={x[0].item():.0f} expected={expected}")
    dist.barrier()
    dist.destroy_process_group()

if __name__ == "__main__":
    main()

Pin `NCCL_IB_HCA` and `NCCL_P2P_LEVEL=NVL` from day one on multi-NIC nodes. Letting NCCL guess works most of the time, but the 10% of the time it picks the wrong NIC affinity costs 30-50% AllReduce throughput silently.

How it works: topology discovery and algorithm selection#

On initialisation NCCL probes the system topology — NVLink links between GPUs, PCIe roots, NIC-to-NUMA affinity, HCA-to-GPU PCIe path, switch hierarchy — and builds an internal graph of available paths and their bandwidths. It then chooses, per collective and per message size, which algorithm to use and which physical paths the data should traverse.

Within a node, NCCL strongly prefers NVLink/NVSwitch where available and falls back to PCIe peer-to-peer, then to host-staging (NUMA-aware bounce buffers in pinned host memory) as the last resort. Across nodes, it uses one or more NICs per GPU; on InfiniBand fabrics it discovers Mellanox HCAs via the `mlx5` driver and uses GPUDirect RDMA to source data directly from GPU HBM; on RoCEv2 fabrics the same path applies but the GID selection is operator-controlled via `NCCL_IB_GID_INDEX`.

Algorithm selection is a runtime decision driven by NCCL's internal cost model. The model knows the topology graph, the message size, the GPU count and the user-provided algorithm/protocol hints, and picks the algorithm that minimises predicted time-to-completion. The five families are listed below.

Protocol selection (`NCCL_PROTO`) is orthogonal to algorithm choice: Simple uses GPU SMs for the reduction, LL (low-latency) uses a 32-bit-aligned packet format that overlaps copy and compute, LL128 uses a 128-byte-aligned variant that is faster on Hopper+ NVLink. Defaults are usually correct; force only when diagnosing.

Ring AllReduce — every GPU forms a logical ring; reduce-scatter then all-gather around it. Optimal bandwidth at large messages on uniform topologies (achieves (N-1)/N x link bandwidth). The default when in doubt.
Tree AllReduce — hierarchical reduction up a binary tree, multicast back down. Lower latency for small messages (< ~16 MB) and better at very large GPU counts where ring latency grows linearly.
NVLS (NVLink SHARP) — uses the in-NVSwitch reduction engine on Hopper+/Blackwell to perform sums inside the NVSwitch ASIC rather than at endpoints. Only available within an NVLink domain.
CollNet / SHARP — in-network reduction on InfiniBand switches (Quantum/Quantum-2/Quantum-3 with SHARPv2/v3/v4). Enabled via `NCCL_COLLNET_ENABLE=1`. Halves bytes on the wire for AllReduce.
PAT (Parallel Aggregated Trees) — introduced in NCCL 2.23 for AllGather and ReduceScatter at scale. Trees of trees that hide latency by overlapping multiple stages.

Reference: environment variables#

The complete NCCL env-var surface is large (~60 variables); the table below is the operationally relevant subset. Pin the ones you care about per cluster via a sourced `nccl.env` file rather than passing them ad-hoc — undocumented env-var differences between launcher and worker contexts is a regular cause of mystery regressions.

Variable	Purpose	Typical value
NCCL_DEBUG	Log verbosity	INFO for first run, WARN in steady state
NCCL_DEBUG_SUBSYS	Filter debug to subsystems	INIT,GRAPH,TUNING,COLL,P2P,NET
NCCL_DEBUG_FILE	Per-rank log file path template	/var/log/nccl/rank-%h-%p.log
NCCL_ALGO	Allowed algorithm set	Tree,Ring,NVLS,CollnetChain
NCCL_PROTO	Allowed protocol set	Simple,LL,LL128
NCCL_P2P_LEVEL	Strictest path for intra-node P2P	NVL (require NVLink) on H100/B200
NCCL_P2P_DISABLE	Disable peer-to-peer entirely	0 (do not set in production)
NCCL_IB_HCA	Pin specific HCAs by name	mlx5_0,mlx5_1,mlx5_2,mlx5_3
NCCL_IB_GID_INDEX	Which RoCEv2 GID to use	3 (typical RoCE v2 over IPv4)
NCCL_IB_TC	DSCP class for RoCEv2 traffic	106 (DSCP 26 x 4)
NCCL_IB_SL	InfiniBand Service Level	3
NCCL_IB_TIMEOUT	QP timeout exponent	22 (default; raise to 23-24 on lossy paths)
NCCL_IB_RETRY_CNT	RDMA retry count	7
NCCL_COLLNET_ENABLE	Enable SHARP CollNet plugin	1 on Quantum-2/Quantum-3 fabrics
NCCL_SOCKET_IFNAME	TCP interface to use (fallback path)	^lo,docker
NCCL_SOCKET_FAMILY	IPv4 or IPv6 socket family	AF_INET
NCCL_TOPO_FILE	Override auto-discovered topology	Path to validated XML topology
NCCL_TOPO_DUMP_FILE	Dump discovered topology	/tmp/nccl-topo.xml (run once, inspect)
NCCL_NTHREADS	Worker threads per device	Auto (rarely tune)
NCCL_BUFFSIZE	Channel buffer size in bytes	8388608 (8 MiB default)
NCCL_MIN_NCHANNELS	Floor on channels per ring	Auto
NCCL_MAX_NCHANNELS	Cap on channels per ring	32
NCCL_LAUNCH_MODE	Kernel launch strategy	GROUP
NCCL_GRAPH_REGISTER	Register buffers with CUDA graphs	1 (Hopper+)
NCCL_NVLS_ENABLE	Enable NVLink SHARP	1 on H100/H200/B200/GB200
NCCL_ASYNC_ERROR_HANDLING	Crash on collective error vs hang	1 (always set; default for PyTorch >= 2.4)

bash

# /etc/profile.d/nccl.sh — canonical NCCL env for an H100 + NDR fabric
# Source this from every job launcher; do NOT scatter ad-hoc exports.

# Logging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,COLL,NET
export NCCL_DEBUG_FILE=/var/log/nccl/rank-%h-%p.log

# Algorithm and protocol — let NCCL pick within this set
export NCCL_ALGO=Tree,Ring,NVLS,CollnetChain,CollnetDirect
export NCCL_PROTO=Simple,LL,LL128

# Intra-node: require NVLink, fail loudly if topology is wrong
export NCCL_P2P_LEVEL=NVL

# InfiniBand: pin all 4 HCAs of an HGX baseboard, in NUMA order
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=106
export NCCL_IB_SL=3
export NCCL_IB_TIMEOUT=22

# Enable in-network reduction (Quantum-2 + SHARPv3)
export NCCL_COLLNET_ENABLE=1
export NCCL_NVLS_ENABLE=1

# Don't bind to docker0 / lo on fallback TCP paths
export NCCL_SOCKET_IFNAME=^lo,docker,virbr

# Crash on collective error rather than hang the job
export NCCL_ASYNC_ERROR_HANDLING=1

Workload patterns#

The three collective patterns that dominate AI workloads. Each has a different scaling behaviour and a different failure mode — knowing which one your training step depends on is the prerequisite for sizing and tuning.

Data-parallel training (DDP, FSDP, ZeRO): AllReduce of gradients every step. Dominated by ring AllReduce at large messages; SHARP helps at scale (> 256 GPUs). Optimisation target: AllReduce bandwidth per GPU close to (N-1)/N x link rate. Failure mode: NIC affinity wrong, throughput halves silently.
Tensor-parallel attention (Megatron-LM, vLLM TP, TensorRT-LLM TP): AllReduce inside each transformer layer's attention and FFN. Smaller messages (16-64 MB), latency-sensitive, intra-node NVLink-bound. Optimisation target: minimise AllReduce latency. Failure mode: forced to PCIe path because two GPUs are on different NUMA nodes.
Expert-parallel MoE all-to-all (Mixtral, DeepSeek-V3, GPT-OSS): AllToAll routing tokens to experts and back. Worst-scaling collective; dominated by bisection bandwidth of the fabric. Optimisation target: maximise per-GPU bisection bandwidth (NVLink within domain, NDR/XDR across). Failure mode: expert routing imbalance + AllToAll = stragglers, tail-bound step time.
On Yobitel NeoCloud, the cluster image's default `/etc/profile.d/nccl.sh` pre-sets the algorithm and HCA-pinning combinations to match the pod's underlying fabric (NDR + SHARPv3 on H100/H200 pods, XDR + SHARPv4 on GB200 pods, Spectrum-X AI-tuned RoCE on the Ethernet-preferring sovereign pods) so customer jobs hit the expected pattern bandwidth on first run.

bash

# Pattern-specific nccl-tests benchmarks
# 1) DDP-style AllReduce sweep
mpirun -np 64 ./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1

# 2) TP-style AllReduce at TP-typical sizes
mpirun -np 8 ./build/all_reduce_perf -b 64K -e 64M -f 2 -g 1

# 3) MoE-style AllToAll
mpirun -np 64 ./build/alltoall_perf -b 32M -e 1G -f 2 -g 1

# 4) FSDP-style ReduceScatter + AllGather
mpirun -np 64 ./build/reduce_scatter_perf -b 8M -e 8G -f 2 -g 1
mpirun -np 64 ./build/all_gather_perf -b 8M -e 8G -f 2 -g 1

Sizing and capacity planning#

Practical bandwidth targets. The table below is what to expect from a well-built H100 cluster running NCCL 2.23+; large deviations indicate fabric or affinity issues. busBW (bus bandwidth) is the per-GPU effective bandwidth, adjusted for the algorithm — ring AllReduce has algBW = (N-1)/N x busBW; tree differs.

Rule of thumb: AllReduce algBW = (N-1)/N x per-GPU link bandwidth. On 8 H100 NVLink (900 GB/s aggregate, ~450 GB/s per direction) the achievable AllReduce is ~390-430 GB/s.
SHARP/NVLS uplift is largest at large messages (> 256 MB) and large GPU counts (> 256). At small scale it can be slightly slower than ring due to tree-setup overhead — verify, don't assume.
AllToAll is bisection-bound: per-GPU AllToAll bandwidth = total fabric bisection / N^2. Halving fabric oversubscription doubles AllToAll throughput in MoE workloads.

Topology	Collective	Message size	Expected busBW	Notes
8x H100 SXM5, single HGX, NVLink only	AllReduce (Ring)	1 GB	~430-470 GB/s	Close to NVLink 4.0 ceiling
8x H100 SXM5, single HGX, NVLink only	AllReduce (NVLS)	1 GB	~470-490 GB/s	NVLink SHARP fastest at large sizes
8x H100, single HGX	AllToAll	256 MB per pair	~380-420 GB/s	Bisection-bound
64x H100 (8 hosts), NDR fat-tree	AllReduce (Ring)	8 GB	~48-52 GB/s	Per-GPU; NDR per-GPU port ~50 GB/s
64x H100, NDR + SHARPv3	AllReduce (CollnetChain)	8 GB	~70-85 GB/s	Effective uplift from in-network reduction
1,024x H100, NDR + SHARPv3	AllReduce (Tree+CollNet)	8 GB	~45-50 GB/s	Tail-bound; SHARP critical at this scale
256x H100, NDR	AllToAll	64 MB per pair	~38-44 GB/s	MoE workload typical
8x B200, single rack, NVLink 5	AllReduce (NVLS)	1 GB	~860-920 GB/s	Double Hopper, with NVSwitch SHARP
72x GB200 NVL72, single NVLink domain	AllReduce (NVLS)	8 GB	~750-820 GB/s	Cross-rack via NVLink switch

Observability#

`NCCL_DEBUG=INFO` produces a per-rank log showing the discovered topology, chosen algorithm per collective, NIC bindings, and any fallback decisions. For production, parse this log on job start to assert expected configuration; the first 200 lines of NCCL output are the most diagnostic signal in your stack. Key indicators to look for are listed below.

Beyond the log, job-level performance counters — bytes transferred per collective, time per collective, queue-pair counters — are exposed via the NVIDIA Resiliency Extension (`nvidia-resiliency-ext`), via UFM Telemetry (port-level on InfiniBand), and via the framework's own profiler. PyTorch Profiler captures per-collective op timing; Nsight Systems shows the full GPU timeline including NCCL kernels.

`NCCL INFO Bootstrap: Using <interface>:<ip>` — confirms rendezvous interface; should match your control-plane NIC, not RDMA NICs.
`NCCL INFO Channel XX/YY` and `NCCL INFO Using network IB` — confirms IB/RoCE path is active, not TCP fallback.
`NCCL INFO Connected ... using CollNet` — confirms SHARP CollNet plugin loaded and active. Silent absence = SHARP fallback to ring; major perf hit.
`NCCL INFO NCCL_NTHREADS set by environment` and `NCCL INFO Algorithm`/`Protocol` lines — confirm tuning hints took effect.
`NCCL WARN ... falling back to ...` — never benign; investigate.

bash

# Parse a NCCL log for the headline indicators
grep -E "NCCL INFO (Bootstrap|Using network|Connected.*CollNet|Algorithm|Protocol)" \
  /var/log/nccl/rank-*.log | sort -u

# Per-rank algorithm choice (should be uniform across ranks)
grep "NCCL INFO Setting affinity" /var/log/nccl/rank-*.log

# Hunt for fallbacks (any line is a red flag)
grep -E "NCCL WARN|falling back|disabled" /var/log/nccl/rank-*.log

# Live perf counters via UFM (InfiniBand)
ufm_rest_cli get ports/counters --filter "rx_pause_count > 0 OR symbol_errors > 0"

Cost and FinOps#

NCCL itself is BSD-licensed and free; the cost it drives is GPU-hour waste from suboptimal collective performance. A 20 % AllReduce regression on a 1,024-GPU H100 training run translates directly to a ~20 % longer run and ~$50-200k in additional GPU-hours per training week, depending on the negotiated rate.

The FinOps levers are:

Validate NCCL_IB_HCA pinning at job start. Saves 30-50% AllReduce throughput on misconfigured nodes; pays for itself in hours.
Enable SHARP/NVLS on supported fabrics. 10-30% AllReduce uplift at scale; free if the hardware is already there.
Pin a single NCCL release per cluster. Mixed-version deadlocks cost full multi-day re-runs (~$200k+ on a 1,024-H100 job).
Use `nccl-tests` as the GPU-cluster smoke test before every multi-week training run. A 15-minute validation up front catches the issues that otherwise surface 4 days into a 14-day run.
Avoid TCP fallback in production. `NCCL_SOCKET_IFNAME=^lo,docker` + `NCCL_IB_DISABLE=0` ensures the IB/RoCE path is used; TCP fallback at 25/100/400 GbE is 5-20x slower.

Security and compliance#

NCCL itself does not implement authentication or encryption between ranks — it assumes a trusted underlying fabric. For multi-tenant clusters, the relevant isolation primitives are InfiniBand PKeys (partition keys, fabric-level VLAN-equivalent) and the GPU's own MIG / Confidential Compute mode (NVIDIA Hopper+ CC-on with attested AES-256-GCM PCIe and HBM encryption).

Operational hardening: run jobs in containers with NCCL traffic confined to dedicated CNI networks, scrub `NCCL_DEBUG_FILE` paths to per-job directories, and never share `/dev/infiniband/*` devices across non-cooperating tenants.

Compliance: NCCL has no compliance posture of its own. Inherit it from the surrounding stack — DGX-bundled NCCL ships under NVIDIA's enterprise support and qualifies for HIPAA/SOC 2 reference designs; community NCCL is FOSS BSD-3 and treated as a library dependency for software bill of materials (SBOM, e.g. SPDX).

Migration and alternatives#

On NVIDIA GPUs there is essentially no NCCL alternative in production — every framework defaults to it. On AMD GPUs the analogue is RCCL (ROCm Collective Communications Library, an NCCL-compatible ABI fork); on AWS Trainium it is the Neuron Collective Communications library; on Intel Gaudi it is HCCL.

Switching from NCCL to RCCL on AMD is usually a one-line backend swap (`backend='rccl'` in PyTorch) plus a `LD_PRELOAD` to the RCCL shim. Test thoroughly: NCCL feature parity gap (NVLS, CollNet) means topology-specific tuning differs.
MSCCLang is worth investigating only when your collective is bespoke (custom topology, non-power-of-2 GPU counts) — write the algorithm in MSCCLang, compile to MSCCL, register as a NCCL plugin.
Falling back to Gloo for GPU collectives is an anti-pattern — Gloo's GPU path is host-staged and 10-50x slower than NCCL. Only use Gloo for CPU-side coordination (rendezvous, barriers).

Alternative	Platform	API compatibility with NCCL	Maturity
RCCL	AMD MI200/MI300	ABI-compatible drop-in	Production; gap closing vs NCCL
oneCCL	Intel CPU/GPU	Similar API, different headers	Production for Intel-only stacks
HCCL	Intel Gaudi 2/3	Habana-specific	Production within Gaudi
Neuron CC	AWS Trainium / Inferentia	Neuron SDK-specific	Production within AWS
MSCCL / MSCCLang	Research / NVIDIA	NCCL-compatible, custom algorithms	Research-grade, NVIDIA-internal optimisations
Gloo	Any (PyTorch backend)	Lower-level, no GPU optimisation	Legacy; CPU-only AllReduce in production

Troubleshooting#

The vast majority of NCCL incidents fall into a small number of categories. The table below maps the symptom you see to the first thing to check.

Symptom	Most likely cause	First action
Job hangs at startup, no NCCL log lines	Rendezvous failure (firewall, wrong interface)	Set NCCL_DEBUG=INFO; verify NCCL_SOCKET_IFNAME; check bootstrap port reachability
NCCL hangs mid-step, all ranks stuck	GPU OOM on one rank, or NCCL_IB_TIMEOUT too tight on lossy fabric	Check `dmesg` on all hosts; raise NCCL_IB_TIMEOUT to 24; verify ECN/PFC on RoCE
AllReduce throughput half of expected	Wrong NIC affinity (cross-NUMA) or PCIe path instead of NVLink	Dump topology with NCCL_TOPO_DUMP_FILE; pin NCCL_IB_HCA in NUMA order; set NCCL_P2P_LEVEL=NVL
AllReduce throughput much less than half expected	TCP fallback active (IB/RoCE init failed)	grep for `Using network Socket` in NCCL log; fix IB driver / GID index
`unhandled cuda error` during collective	Mismatched CUDA versions across ranks, or buffer freed before kernel complete	Pin CUDA + driver version per node; ensure synchronisation before tensor reuse
SHARP/CollNet absent silently	libnccl-net.so / libsharp.so not on loader path, or Aggregation Manager not running	Check `ldconfig -p \| grep nccl`; verify `sharp_am` process on the head node
NCCL OOM (`ncclSystemError: out of memory`)	NCCL_BUFFSIZE too large for available GPU memory	Lower NCCL_BUFFSIZE from default 8 MiB to 4 MiB; reduce NCCL_MAX_NCHANNELS
RoCEv2: AllReduce collapses under load	PFC/ECN not configured or wrong DSCP/PCP mapping	Verify NCCL_IB_TC matches DSCP on switches; check PFC pause counters on TOR
Mixed-version deadlock at init	Different NCCL versions across nodes	Standardise NCCL version per cluster; pin in container images
Collective slows over time	Memory fragmentation in cudaMallocFromPoolAsync, or SHARP AN exhaustion	Restart job; check Aggregation Manager AN allocation

The most expensive NCCL incident class is silent: SHARP enabled, the env var set, but the CollNet plugin failing to load — the job runs to completion at ring-AllReduce performance and the operator never notices the missing 20-30 % uplift. Always grep for `Connected ... using CollNet` in the first job's NCCL log on every new cluster.

Where this fits in the Yobitel stack#

Every multi-GPU training and large-context inference workload running on Yobitel's GPU cloud, NeoCloud Tier-III pods, or sovereign UK clusters goes through NCCL. Yobitel's fabric defaults — Quantum-2 NDR with SHARPv3 on the H100/H200 pods, Quantum-3 XDR with SHARPv4 on the Blackwell pods, Spectrum-X RoCEv2 on the Ethernet-preferring sovereign pods — are tuned so that the default NCCL env-var profile shipped with the cluster image hits the bandwidth targets in the sizing table above without per-job tweaking.

Customers running on Yobibyte do not need to touch NCCL directly — the managed inference and fine-tune surface handles topology pinning, NIC affinity and SHARP enablement automatically. Customers running on raw Yobitel GPU Cloud get a `/etc/profile.d/nccl.sh` baked into the image with the cluster-appropriate variables already set, plus the `nccl-tests` binaries pre-installed at `/opt/nccl-tests/build/`.

References

NCCL Documentation · NVIDIA
NCCL Environment Variables Reference · NVIDIA
NCCL GitHub Repository · NVIDIA
nccl-tests Benchmark Suite · NVIDIA
Massively Scale Your Deep Learning Training with NCCL 2.x · NVIDIA Developer Blog
PyTorch Distributed: NCCL Backend · PyTorch

TL;DR

NCCL (pronounced 'nickel') is NVIDIA's open-source library of GPU-aware collective operations — AllReduce, AllGather, ReduceScatter, Broadcast, AllToAll, plus point-to-point Send/Recv — released under BSD 3-Clause and the default multi-GPU comms backend for every major training and inference framework.
Auto-discovers system topology (NVLink/NVSwitch, PCIe trees, NIC affinity, InfiniBand, RoCEv2) and selects ring / tree / NVLS / CollNet (SHARP) / PAT algorithms per collective, per message size, per cluster shape.
Used by PyTorch DDP and FSDP, DeepSpeed ZeRO, Megatron-LM, JAX/XLA, vLLM, TensorRT-LLM, SGLang, and almost every other multi-GPU runtime — the lingua franca of NVIDIA-GPU collectives.
Tuned via ~60 environment variables; the half-dozen that matter most are `NCCL_ALGO`, `NCCL_PROTO`, `NCCL_IB_HCA`, `NCCL_IB_GID_INDEX`, `NCCL_COLLNET_ENABLE`, `NCCL_P2P_LEVEL`, plus `NCCL_DEBUG=INFO` for diagnosis.
Performance rule of thumb: ring AllReduce on N GPUs reaches (N-1)/N x link bandwidth at large message sizes; tree/SHARP outperforms ring below ~16 MB and above ~512 GPUs; AllToAll for MoE all-to-all stages scales worst and is usually the first collective to bottleneck.

Overview#

Quick start: minimal multi-GPU AllReduce in PyTorch#

Multi-node: replace `torchrun --nproc_per_node` with `torchrun --nnodes N --node_rank R --rdzv_endpoint <head>:29500`. SLURM users wrap in `srun` with the launcher of their choice.
First-time validation should always be `nccl-tests` AllReduce, not a real training job. See the `Sizing and capacity planning` section for expected throughput.
Set `NCCL_DEBUG=INFO` for the first run on any new cluster. The log shows the discovered topology, chosen algorithm and NIC binding — if anything looks wrong, fix it before launching production training.

python

# allreduce.py — minimal NCCL AllReduce smoke test
# Launch: NCCL_DEBUG=INFO torchrun --nproc_per_node=auto allreduce.py
import os
import torch
import torch.distributed as dist

def main() -> None:
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    world = dist.get_world_size()
    torch.cuda.set_device(rank % torch.cuda.device_count())

    x = torch.full((1 << 20,), float(rank), device="cuda")  # 4 MiB float32
    dist.all_reduce(x, op=dist.ReduceOp.SUM)

    expected = sum(range(world))
    print(f"[rank {rank}/{world}] AllReduce sum head={x[0].item():.0f} expected={expected}")
    dist.barrier()
    dist.destroy_process_group()

if __name__ == "__main__":
    main()

How it works: topology discovery and algorithm selection#

Ring AllReduce — every GPU forms a logical ring; reduce-scatter then all-gather around it. Optimal bandwidth at large messages on uniform topologies (achieves (N-1)/N x link bandwidth). The default when in doubt.
Tree AllReduce — hierarchical reduction up a binary tree, multicast back down. Lower latency for small messages (< ~16 MB) and better at very large GPU counts where ring latency grows linearly.
NVLS (NVLink SHARP) — uses the in-NVSwitch reduction engine on Hopper+/Blackwell to perform sums inside the NVSwitch ASIC rather than at endpoints. Only available within an NVLink domain.
CollNet / SHARP — in-network reduction on InfiniBand switches (Quantum/Quantum-2/Quantum-3 with SHARPv2/v3/v4). Enabled via `NCCL_COLLNET_ENABLE=1`. Halves bytes on the wire for AllReduce.
PAT (Parallel Aggregated Trees) — introduced in NCCL 2.23 for AllGather and ReduceScatter at scale. Trees of trees that hide latency by overlapping multiple stages.

Reference: environment variables#

Variable	Purpose	Typical value
NCCL_DEBUG	Log verbosity	INFO for first run, WARN in steady state
NCCL_DEBUG_SUBSYS	Filter debug to subsystems	INIT,GRAPH,TUNING,COLL,P2P,NET
NCCL_DEBUG_FILE	Per-rank log file path template	/var/log/nccl/rank-%h-%p.log
NCCL_ALGO	Allowed algorithm set	Tree,Ring,NVLS,CollnetChain
NCCL_PROTO	Allowed protocol set	Simple,LL,LL128
NCCL_P2P_LEVEL	Strictest path for intra-node P2P	NVL (require NVLink) on H100/B200
NCCL_P2P_DISABLE	Disable peer-to-peer entirely	0 (do not set in production)
NCCL_IB_HCA	Pin specific HCAs by name	mlx5_0,mlx5_1,mlx5_2,mlx5_3
NCCL_IB_GID_INDEX	Which RoCEv2 GID to use	3 (typical RoCE v2 over IPv4)
NCCL_IB_TC	DSCP class for RoCEv2 traffic	106 (DSCP 26 x 4)
NCCL_IB_SL	InfiniBand Service Level	3
NCCL_IB_TIMEOUT	QP timeout exponent	22 (default; raise to 23-24 on lossy paths)
NCCL_IB_RETRY_CNT	RDMA retry count	7
NCCL_COLLNET_ENABLE	Enable SHARP CollNet plugin	1 on Quantum-2/Quantum-3 fabrics
NCCL_SOCKET_IFNAME	TCP interface to use (fallback path)	^lo,docker
NCCL_SOCKET_FAMILY	IPv4 or IPv6 socket family	AF_INET
NCCL_TOPO_FILE	Override auto-discovered topology	Path to validated XML topology
NCCL_TOPO_DUMP_FILE	Dump discovered topology	/tmp/nccl-topo.xml (run once, inspect)
NCCL_NTHREADS	Worker threads per device	Auto (rarely tune)
NCCL_BUFFSIZE	Channel buffer size in bytes	8388608 (8 MiB default)
NCCL_MIN_NCHANNELS	Floor on channels per ring	Auto
NCCL_MAX_NCHANNELS	Cap on channels per ring	32
NCCL_LAUNCH_MODE	Kernel launch strategy	GROUP
NCCL_GRAPH_REGISTER	Register buffers with CUDA graphs	1 (Hopper+)
NCCL_NVLS_ENABLE	Enable NVLink SHARP	1 on H100/H200/B200/GB200
NCCL_ASYNC_ERROR_HANDLING	Crash on collective error vs hang	1 (always set; default for PyTorch >= 2.4)

bash

# /etc/profile.d/nccl.sh — canonical NCCL env for an H100 + NDR fabric
# Source this from every job launcher; do NOT scatter ad-hoc exports.

# Logging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,COLL,NET
export NCCL_DEBUG_FILE=/var/log/nccl/rank-%h-%p.log

# Algorithm and protocol — let NCCL pick within this set
export NCCL_ALGO=Tree,Ring,NVLS,CollnetChain,CollnetDirect
export NCCL_PROTO=Simple,LL,LL128

# Intra-node: require NVLink, fail loudly if topology is wrong
export NCCL_P2P_LEVEL=NVL

# InfiniBand: pin all 4 HCAs of an HGX baseboard, in NUMA order
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=106
export NCCL_IB_SL=3
export NCCL_IB_TIMEOUT=22

# Enable in-network reduction (Quantum-2 + SHARPv3)
export NCCL_COLLNET_ENABLE=1
export NCCL_NVLS_ENABLE=1

# Don't bind to docker0 / lo on fallback TCP paths
export NCCL_SOCKET_IFNAME=^lo,docker,virbr

# Crash on collective error rather than hang the job
export NCCL_ASYNC_ERROR_HANDLING=1

Workload patterns#

Data-parallel training (DDP, FSDP, ZeRO): AllReduce of gradients every step. Dominated by ring AllReduce at large messages; SHARP helps at scale (> 256 GPUs). Optimisation target: AllReduce bandwidth per GPU close to (N-1)/N x link rate. Failure mode: NIC affinity wrong, throughput halves silently.
Tensor-parallel attention (Megatron-LM, vLLM TP, TensorRT-LLM TP): AllReduce inside each transformer layer's attention and FFN. Smaller messages (16-64 MB), latency-sensitive, intra-node NVLink-bound. Optimisation target: minimise AllReduce latency. Failure mode: forced to PCIe path because two GPUs are on different NUMA nodes.
Expert-parallel MoE all-to-all (Mixtral, DeepSeek-V3, GPT-OSS): AllToAll routing tokens to experts and back. Worst-scaling collective; dominated by bisection bandwidth of the fabric. Optimisation target: maximise per-GPU bisection bandwidth (NVLink within domain, NDR/XDR across). Failure mode: expert routing imbalance + AllToAll = stragglers, tail-bound step time.
On Yobitel NeoCloud, the cluster image's default `/etc/profile.d/nccl.sh` pre-sets the algorithm and HCA-pinning combinations to match the pod's underlying fabric (NDR + SHARPv3 on H100/H200 pods, XDR + SHARPv4 on GB200 pods, Spectrum-X AI-tuned RoCE on the Ethernet-preferring sovereign pods) so customer jobs hit the expected pattern bandwidth on first run.

bash

# Pattern-specific nccl-tests benchmarks
# 1) DDP-style AllReduce sweep
mpirun -np 64 ./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1

# 2) TP-style AllReduce at TP-typical sizes
mpirun -np 8 ./build/all_reduce_perf -b 64K -e 64M -f 2 -g 1

# 3) MoE-style AllToAll
mpirun -np 64 ./build/alltoall_perf -b 32M -e 1G -f 2 -g 1

# 4) FSDP-style ReduceScatter + AllGather
mpirun -np 64 ./build/reduce_scatter_perf -b 8M -e 8G -f 2 -g 1
mpirun -np 64 ./build/all_gather_perf -b 8M -e 8G -f 2 -g 1

Sizing and capacity planning#

Rule of thumb: AllReduce algBW = (N-1)/N x per-GPU link bandwidth. On 8 H100 NVLink (900 GB/s aggregate, ~450 GB/s per direction) the achievable AllReduce is ~390-430 GB/s.
SHARP/NVLS uplift is largest at large messages (> 256 MB) and large GPU counts (> 256). At small scale it can be slightly slower than ring due to tree-setup overhead — verify, don't assume.
AllToAll is bisection-bound: per-GPU AllToAll bandwidth = total fabric bisection / N^2. Halving fabric oversubscription doubles AllToAll throughput in MoE workloads.

Topology	Collective	Message size	Expected busBW	Notes
8x H100 SXM5, single HGX, NVLink only	AllReduce (Ring)	1 GB	~430-470 GB/s	Close to NVLink 4.0 ceiling
8x H100 SXM5, single HGX, NVLink only	AllReduce (NVLS)	1 GB	~470-490 GB/s	NVLink SHARP fastest at large sizes
8x H100, single HGX	AllToAll	256 MB per pair	~380-420 GB/s	Bisection-bound
64x H100 (8 hosts), NDR fat-tree	AllReduce (Ring)	8 GB	~48-52 GB/s	Per-GPU; NDR per-GPU port ~50 GB/s
64x H100, NDR + SHARPv3	AllReduce (CollnetChain)	8 GB	~70-85 GB/s	Effective uplift from in-network reduction
1,024x H100, NDR + SHARPv3	AllReduce (Tree+CollNet)	8 GB	~45-50 GB/s	Tail-bound; SHARP critical at this scale
256x H100, NDR	AllToAll	64 MB per pair	~38-44 GB/s	MoE workload typical
8x B200, single rack, NVLink 5	AllReduce (NVLS)	1 GB	~860-920 GB/s	Double Hopper, with NVSwitch SHARP
72x GB200 NVL72, single NVLink domain	AllReduce (NVLS)	8 GB	~750-820 GB/s	Cross-rack via NVLink switch

Observability#

`NCCL INFO Bootstrap: Using <interface>:<ip>` — confirms rendezvous interface; should match your control-plane NIC, not RDMA NICs.
`NCCL INFO Channel XX/YY` and `NCCL INFO Using network IB` — confirms IB/RoCE path is active, not TCP fallback.
`NCCL INFO Connected ... using CollNet` — confirms SHARP CollNet plugin loaded and active. Silent absence = SHARP fallback to ring; major perf hit.
`NCCL INFO NCCL_NTHREADS set by environment` and `NCCL INFO Algorithm`/`Protocol` lines — confirm tuning hints took effect.
`NCCL WARN ... falling back to ...` — never benign; investigate.

bash

# Parse a NCCL log for the headline indicators
grep -E "NCCL INFO (Bootstrap|Using network|Connected.*CollNet|Algorithm|Protocol)" \
  /var/log/nccl/rank-*.log | sort -u

# Per-rank algorithm choice (should be uniform across ranks)
grep "NCCL INFO Setting affinity" /var/log/nccl/rank-*.log

# Hunt for fallbacks (any line is a red flag)
grep -E "NCCL WARN|falling back|disabled" /var/log/nccl/rank-*.log

# Live perf counters via UFM (InfiniBand)
ufm_rest_cli get ports/counters --filter "rx_pause_count > 0 OR symbol_errors > 0"

Cost and FinOps#

The FinOps levers are:

Validate NCCL_IB_HCA pinning at job start. Saves 30-50% AllReduce throughput on misconfigured nodes; pays for itself in hours.
Enable SHARP/NVLS on supported fabrics. 10-30% AllReduce uplift at scale; free if the hardware is already there.
Pin a single NCCL release per cluster. Mixed-version deadlocks cost full multi-day re-runs (~$200k+ on a 1,024-H100 job).
Use `nccl-tests` as the GPU-cluster smoke test before every multi-week training run. A 15-minute validation up front catches the issues that otherwise surface 4 days into a 14-day run.
Avoid TCP fallback in production. `NCCL_SOCKET_IFNAME=^lo,docker` + `NCCL_IB_DISABLE=0` ensures the IB/RoCE path is used; TCP fallback at 25/100/400 GbE is 5-20x slower.

Security and compliance#

Migration and alternatives#

Switching from NCCL to RCCL on AMD is usually a one-line backend swap (`backend='rccl'` in PyTorch) plus a `LD_PRELOAD` to the RCCL shim. Test thoroughly: NCCL feature parity gap (NVLS, CollNet) means topology-specific tuning differs.
MSCCLang is worth investigating only when your collective is bespoke (custom topology, non-power-of-2 GPU counts) — write the algorithm in MSCCLang, compile to MSCCL, register as a NCCL plugin.
Falling back to Gloo for GPU collectives is an anti-pattern — Gloo's GPU path is host-staged and 10-50x slower than NCCL. Only use Gloo for CPU-side coordination (rendezvous, barriers).

Alternative	Platform	API compatibility with NCCL	Maturity
RCCL	AMD MI200/MI300	ABI-compatible drop-in	Production; gap closing vs NCCL
oneCCL	Intel CPU/GPU	Similar API, different headers	Production for Intel-only stacks
HCCL	Intel Gaudi 2/3	Habana-specific	Production within Gaudi
Neuron CC	AWS Trainium / Inferentia	Neuron SDK-specific	Production within AWS
MSCCL / MSCCLang	Research / NVIDIA	NCCL-compatible, custom algorithms	Research-grade, NVIDIA-internal optimisations
Gloo	Any (PyTorch backend)	Lower-level, no GPU optimisation	Legacy; CPU-only AllReduce in production

Troubleshooting#

The vast majority of NCCL incidents fall into a small number of categories. The table below maps the symptom you see to the first thing to check.

Symptom	Most likely cause	First action
Job hangs at startup, no NCCL log lines	Rendezvous failure (firewall, wrong interface)	Set NCCL_DEBUG=INFO; verify NCCL_SOCKET_IFNAME; check bootstrap port reachability
NCCL hangs mid-step, all ranks stuck	GPU OOM on one rank, or NCCL_IB_TIMEOUT too tight on lossy fabric	Check `dmesg` on all hosts; raise NCCL_IB_TIMEOUT to 24; verify ECN/PFC on RoCE
AllReduce throughput half of expected	Wrong NIC affinity (cross-NUMA) or PCIe path instead of NVLink	Dump topology with NCCL_TOPO_DUMP_FILE; pin NCCL_IB_HCA in NUMA order; set NCCL_P2P_LEVEL=NVL
AllReduce throughput much less than half expected	TCP fallback active (IB/RoCE init failed)	grep for `Using network Socket` in NCCL log; fix IB driver / GID index
`unhandled cuda error` during collective	Mismatched CUDA versions across ranks, or buffer freed before kernel complete	Pin CUDA + driver version per node; ensure synchronisation before tensor reuse
SHARP/CollNet absent silently	libnccl-net.so / libsharp.so not on loader path, or Aggregation Manager not running	Check `ldconfig -p \| grep nccl`; verify `sharp_am` process on the head node
NCCL OOM (`ncclSystemError: out of memory`)	NCCL_BUFFSIZE too large for available GPU memory	Lower NCCL_BUFFSIZE from default 8 MiB to 4 MiB; reduce NCCL_MAX_NCHANNELS
RoCEv2: AllReduce collapses under load	PFC/ECN not configured or wrong DSCP/PCP mapping	Verify NCCL_IB_TC matches DSCP on switches; check PFC pause counters on TOR
Mixed-version deadlock at init	Different NCCL versions across nodes	Standardise NCCL version per cluster; pin in container images
Collective slows over time	Memory fragmentation in cudaMallocFromPoolAsync, or SHARP AN exhaustion	Restart job; check Aggregation Manager AN allocation

Where this fits in the Yobitel stack#

References

NCCL Documentation · NVIDIA
NCCL Environment Variables Reference · NVIDIA
NCCL GitHub Repository · NVIDIA
nccl-tests Benchmark Suite · NVIDIA
Massively Scale Your Deep Learning Training with NCCL 2.x · NVIDIA Developer Blog
PyTorch Distributed: NCCL Backend · PyTorch

NCCL (NVIDIA Collective Communications Library)

Overview#

Quick start: minimal multi-GPU AllReduce in PyTorch#

How it works: topology discovery and algorithm selection#

Reference: environment variables#

Workload patterns#

Sizing and capacity planning#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

NCCL (NVIDIA Collective Communications Library)

Overview#

Quick start: minimal multi-GPU AllReduce in PyTorch#

How it works: topology discovery and algorithm selection#

Reference: environment variables#

Workload patterns#

Sizing and capacity planning#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel