TL;DR
- NCCL (pronounced 'nickel') is NVIDIA's open-source library of GPU-aware collective operations — AllReduce, AllGather, ReduceScatter, Broadcast, AllToAll, plus point-to-point Send/Recv — released under BSD 3-Clause and the default multi-GPU comms backend for every major training and inference framework.
- Auto-discovers system topology (NVLink/NVSwitch, PCIe trees, NIC affinity, InfiniBand, RoCEv2) and selects ring / tree / NVLS / CollNet (SHARP) / PAT algorithms per collective, per message size, per cluster shape.
- Used by PyTorch DDP and FSDP, DeepSpeed ZeRO, Megatron-LM, JAX/XLA, vLLM, TensorRT-LLM, SGLang, and almost every other multi-GPU runtime — the lingua franca of NVIDIA-GPU collectives.
- Tuned via ~60 environment variables; the half-dozen that matter most are `NCCL_ALGO`, `NCCL_PROTO`, `NCCL_IB_HCA`, `NCCL_IB_GID_INDEX`, `NCCL_COLLNET_ENABLE`, `NCCL_P2P_LEVEL`, plus `NCCL_DEBUG=INFO` for diagnosis.
- Performance rule of thumb: ring AllReduce on N GPUs reaches (N-1)/N x link bandwidth at large message sizes; tree/SHARP outperforms ring below ~16 MB and above ~512 GPUs; AllToAll for MoE all-to-all stages scales worst and is usually the first collective to bottleneck.
Overview#
The NVIDIA Collective Communications Library is the layer that turns a collection of GPUs into a distributed training fabric. It implements the standard MPI-style collective operations — AllReduce, AllGather, ReduceScatter, Broadcast, Reduce, AllToAll, plus point-to-point Send/Recv — but with GPU-direct memory access, hardware-aware topology discovery, and per-fabric algorithm selection. Where MPI was the lingua franca of CPU-centric HPC, NCCL is the lingua franca of GPU-centric AI.
Almost every distributed deep-learning framework calls into NCCL for multi-GPU communication: PyTorch's DistributedDataParallel and FSDP, DeepSpeed ZeRO-1/2/3, Megatron-LM, JAX/XLA via the XLA NCCL plugin, vLLM for tensor-parallel inference, TensorRT-LLM for the same, SGLang, NVIDIA NIM. Understanding NCCL is non-negotiable for anyone operating GPU clusters at scale — every silent performance regression and most hangs trace back to it.
Released in 2016 by NVIDIA, NCCL has had a steady minor-release cadence since: NCCL 2.18 introduced NVLS (NVLink SHARP) for Hopper, NCCL 2.20 added FP8 reductions for Transformer Engine, NCCL 2.21+ added PAT (Parallel Aggregated Trees) for AllGather/ReduceScatter at scale. As of 2026 the actively maintained line is NCCL 2.23+; older 2.x releases still ship in older CUDA images but should not be used for new builds.
NCCL is the collective layer Yobibyte uses on every multi-GPU workload — fine-tunes, batched inference, tensor-parallel serving — and the default comms backend on every Yobitel NeoCloud cluster image. This entry helps you operate NCCL in production and reach the line-rate behaviour you paid for, especially when a silent algorithm or NIC-affinity regression is the difference between a 10-day and a 14-day training run.
Quick start: minimal multi-GPU AllReduce in PyTorch#
The shortest path from zero to a working NCCL collective. The script below initialises a PyTorch process group with the NCCL backend across the visible GPUs on a single node, runs an AllReduce of a small tensor, and prints the result on every rank. Launch with `torchrun --nproc_per_node=auto allreduce.py`. Verify the NCCL log shows the expected NIC and algorithm — then move to the `nccl-tests` benchmarks for a real measurement.
- Multi-node: replace `torchrun --nproc_per_node` with `torchrun --nnodes N --node_rank R --rdzv_endpoint <head>:29500`. SLURM users wrap in `srun` with the launcher of their choice.
- First-time validation should always be `nccl-tests` AllReduce, not a real training job. See the `Sizing and capacity planning` section for expected throughput.
- Set `NCCL_DEBUG=INFO` for the first run on any new cluster. The log shows the discovered topology, chosen algorithm and NIC binding — if anything looks wrong, fix it before launching production training.
# allreduce.py — minimal NCCL AllReduce smoke test
# Launch: NCCL_DEBUG=INFO torchrun --nproc_per_node=auto allreduce.py
import os
import torch
import torch.distributed as dist
def main() -> None:
dist.init_process_group(backend="nccl")
rank = dist.get_rank()
world = dist.get_world_size()
torch.cuda.set_device(rank % torch.cuda.device_count())
x = torch.full((1 << 20,), float(rank), device="cuda") # 4 MiB float32
dist.all_reduce(x, op=dist.ReduceOp.SUM)
expected = sum(range(world))
print(f"[rank {rank}/{world}] AllReduce sum head={x[0].item():.0f} expected={expected}")
dist.barrier()
dist.destroy_process_group()
if __name__ == "__main__":
main()Pin `NCCL_IB_HCA` and `NCCL_P2P_LEVEL=NVL` from day one on multi-NIC nodes. Letting NCCL guess works most of the time, but the 10% of the time it picks the wrong NIC affinity costs 30-50% AllReduce throughput silently.
How it works: topology discovery and algorithm selection#
On initialisation NCCL probes the system topology — NVLink links between GPUs, PCIe roots, NIC-to-NUMA affinity, HCA-to-GPU PCIe path, switch hierarchy — and builds an internal graph of available paths and their bandwidths. It then chooses, per collective and per message size, which algorithm to use and which physical paths the data should traverse.
Within a node, NCCL strongly prefers NVLink/NVSwitch where available and falls back to PCIe peer-to-peer, then to host-staging (NUMA-aware bounce buffers in pinned host memory) as the last resort. Across nodes, it uses one or more NICs per GPU; on InfiniBand fabrics it discovers Mellanox HCAs via the `mlx5` driver and uses GPUDirect RDMA to source data directly from GPU HBM; on RoCEv2 fabrics the same path applies but the GID selection is operator-controlled via `NCCL_IB_GID_INDEX`.
Algorithm selection is a runtime decision driven by NCCL's internal cost model. The model knows the topology graph, the message size, the GPU count and the user-provided algorithm/protocol hints, and picks the algorithm that minimises predicted time-to-completion. The five families are listed below.
Protocol selection (`NCCL_PROTO`) is orthogonal to algorithm choice: Simple uses GPU SMs for the reduction, LL (low-latency) uses a 32-bit-aligned packet format that overlaps copy and compute, LL128 uses a 128-byte-aligned variant that is faster on Hopper+ NVLink. Defaults are usually correct; force only when diagnosing.
- Ring AllReduce — every GPU forms a logical ring; reduce-scatter then all-gather around it. Optimal bandwidth at large messages on uniform topologies (achieves (N-1)/N x link bandwidth). The default when in doubt.
- Tree AllReduce — hierarchical reduction up a binary tree, multicast back down. Lower latency for small messages (< ~16 MB) and better at very large GPU counts where ring latency grows linearly.
- NVLS (NVLink SHARP) — uses the in-NVSwitch reduction engine on Hopper+/Blackwell to perform sums inside the NVSwitch ASIC rather than at endpoints. Only available within an NVLink domain.
- CollNet / SHARP — in-network reduction on InfiniBand switches (Quantum/Quantum-2/Quantum-3 with SHARPv2/v3/v4). Enabled via `NCCL_COLLNET_ENABLE=1`. Halves bytes on the wire for AllReduce.
- PAT (Parallel Aggregated Trees) — introduced in NCCL 2.23 for AllGather and ReduceScatter at scale. Trees of trees that hide latency by overlapping multiple stages.
Reference: environment variables#
The complete NCCL env-var surface is large (~60 variables); the table below is the operationally relevant subset. Pin the ones you care about per cluster via a sourced `nccl.env` file rather than passing them ad-hoc — undocumented env-var differences between launcher and worker contexts is a regular cause of mystery regressions.
| Variable | Purpose | Typical value |
|---|---|---|
| NCCL_DEBUG | Log verbosity | INFO for first run, WARN in steady state |
| NCCL_DEBUG_SUBSYS | Filter debug to subsystems | INIT,GRAPH,TUNING,COLL,P2P,NET |
| NCCL_DEBUG_FILE | Per-rank log file path template | /var/log/nccl/rank-%h-%p.log |
| NCCL_ALGO | Allowed algorithm set | Tree,Ring,NVLS,CollnetChain |
| NCCL_PROTO | Allowed protocol set | Simple,LL,LL128 |
| NCCL_P2P_LEVEL | Strictest path for intra-node P2P | NVL (require NVLink) on H100/B200 |
| NCCL_P2P_DISABLE | Disable peer-to-peer entirely | 0 (do not set in production) |
| NCCL_IB_HCA | Pin specific HCAs by name | mlx5_0,mlx5_1,mlx5_2,mlx5_3 |
| NCCL_IB_GID_INDEX | Which RoCEv2 GID to use | 3 (typical RoCE v2 over IPv4) |
| NCCL_IB_TC | DSCP class for RoCEv2 traffic | 106 (DSCP 26 x 4) |
| NCCL_IB_SL | InfiniBand Service Level | 3 |
| NCCL_IB_TIMEOUT | QP timeout exponent | 22 (default; raise to 23-24 on lossy paths) |
| NCCL_IB_RETRY_CNT | RDMA retry count | 7 |
| NCCL_COLLNET_ENABLE | Enable SHARP CollNet plugin | 1 on Quantum-2/Quantum-3 fabrics |
| NCCL_SOCKET_IFNAME | TCP interface to use (fallback path) | ^lo,docker |
| NCCL_SOCKET_FAMILY | IPv4 or IPv6 socket family | AF_INET |
| NCCL_TOPO_FILE | Override auto-discovered topology | Path to validated XML topology |
| NCCL_TOPO_DUMP_FILE | Dump discovered topology | /tmp/nccl-topo.xml (run once, inspect) |
| NCCL_NTHREADS | Worker threads per device | Auto (rarely tune) |
| NCCL_BUFFSIZE | Channel buffer size in bytes | 8388608 (8 MiB default) |
| NCCL_MIN_NCHANNELS | Floor on channels per ring | Auto |
| NCCL_MAX_NCHANNELS | Cap on channels per ring | 32 |
| NCCL_LAUNCH_MODE | Kernel launch strategy | GROUP |
| NCCL_GRAPH_REGISTER | Register buffers with CUDA graphs | 1 (Hopper+) |
| NCCL_NVLS_ENABLE | Enable NVLink SHARP | 1 on H100/H200/B200/GB200 |
| NCCL_ASYNC_ERROR_HANDLING | Crash on collective error vs hang | 1 (always set; default for PyTorch >= 2.4) |
# /etc/profile.d/nccl.sh — canonical NCCL env for an H100 + NDR fabric
# Source this from every job launcher; do NOT scatter ad-hoc exports.
# Logging
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,COLL,NET
export NCCL_DEBUG_FILE=/var/log/nccl/rank-%h-%p.log
# Algorithm and protocol — let NCCL pick within this set
export NCCL_ALGO=Tree,Ring,NVLS,CollnetChain,CollnetDirect
export NCCL_PROTO=Simple,LL,LL128
# Intra-node: require NVLink, fail loudly if topology is wrong
export NCCL_P2P_LEVEL=NVL
# InfiniBand: pin all 4 HCAs of an HGX baseboard, in NUMA order
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=106
export NCCL_IB_SL=3
export NCCL_IB_TIMEOUT=22
# Enable in-network reduction (Quantum-2 + SHARPv3)
export NCCL_COLLNET_ENABLE=1
export NCCL_NVLS_ENABLE=1
# Don't bind to docker0 / lo on fallback TCP paths
export NCCL_SOCKET_IFNAME=^lo,docker,virbr
# Crash on collective error rather than hang the job
export NCCL_ASYNC_ERROR_HANDLING=1Workload patterns#
The three collective patterns that dominate AI workloads. Each has a different scaling behaviour and a different failure mode — knowing which one your training step depends on is the prerequisite for sizing and tuning.
- Data-parallel training (DDP, FSDP, ZeRO): AllReduce of gradients every step. Dominated by ring AllReduce at large messages; SHARP helps at scale (> 256 GPUs). Optimisation target: AllReduce bandwidth per GPU close to (N-1)/N x link rate. Failure mode: NIC affinity wrong, throughput halves silently.
- Tensor-parallel attention (Megatron-LM, vLLM TP, TensorRT-LLM TP): AllReduce inside each transformer layer's attention and FFN. Smaller messages (16-64 MB), latency-sensitive, intra-node NVLink-bound. Optimisation target: minimise AllReduce latency. Failure mode: forced to PCIe path because two GPUs are on different NUMA nodes.
- Expert-parallel MoE all-to-all (Mixtral, DeepSeek-V3, GPT-OSS): AllToAll routing tokens to experts and back. Worst-scaling collective; dominated by bisection bandwidth of the fabric. Optimisation target: maximise per-GPU bisection bandwidth (NVLink within domain, NDR/XDR across). Failure mode: expert routing imbalance + AllToAll = stragglers, tail-bound step time.
- On Yobitel NeoCloud, the cluster image's default `/etc/profile.d/nccl.sh` pre-sets the algorithm and HCA-pinning combinations to match the pod's underlying fabric (NDR + SHARPv3 on H100/H200 pods, XDR + SHARPv4 on GB200 pods, Spectrum-X AI-tuned RoCE on the Ethernet-preferring sovereign pods) so customer jobs hit the expected pattern bandwidth on first run.
# Pattern-specific nccl-tests benchmarks
# 1) DDP-style AllReduce sweep
mpirun -np 64 ./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1
# 2) TP-style AllReduce at TP-typical sizes
mpirun -np 8 ./build/all_reduce_perf -b 64K -e 64M -f 2 -g 1
# 3) MoE-style AllToAll
mpirun -np 64 ./build/alltoall_perf -b 32M -e 1G -f 2 -g 1
# 4) FSDP-style ReduceScatter + AllGather
mpirun -np 64 ./build/reduce_scatter_perf -b 8M -e 8G -f 2 -g 1
mpirun -np 64 ./build/all_gather_perf -b 8M -e 8G -f 2 -g 1Sizing and capacity planning#
Practical bandwidth targets. The table below is what to expect from a well-built H100 cluster running NCCL 2.23+; large deviations indicate fabric or affinity issues. busBW (bus bandwidth) is the per-GPU effective bandwidth, adjusted for the algorithm — ring AllReduce has algBW = (N-1)/N x busBW; tree differs.
- Rule of thumb: AllReduce algBW = (N-1)/N x per-GPU link bandwidth. On 8 H100 NVLink (900 GB/s aggregate, ~450 GB/s per direction) the achievable AllReduce is ~390-430 GB/s.
- SHARP/NVLS uplift is largest at large messages (> 256 MB) and large GPU counts (> 256). At small scale it can be slightly slower than ring due to tree-setup overhead — verify, don't assume.
- AllToAll is bisection-bound: per-GPU AllToAll bandwidth = total fabric bisection / N^2. Halving fabric oversubscription doubles AllToAll throughput in MoE workloads.
| Topology | Collective | Message size | Expected busBW | Notes |
|---|---|---|---|---|
| 8x H100 SXM5, single HGX, NVLink only | AllReduce (Ring) | 1 GB | ~430-470 GB/s | Close to NVLink 4.0 ceiling |
| 8x H100 SXM5, single HGX, NVLink only | AllReduce (NVLS) | 1 GB | ~470-490 GB/s | NVLink SHARP fastest at large sizes |
| 8x H100, single HGX | AllToAll | 256 MB per pair | ~380-420 GB/s | Bisection-bound |
| 64x H100 (8 hosts), NDR fat-tree | AllReduce (Ring) | 8 GB | ~48-52 GB/s | Per-GPU; NDR per-GPU port ~50 GB/s |
| 64x H100, NDR + SHARPv3 | AllReduce (CollnetChain) | 8 GB | ~70-85 GB/s | Effective uplift from in-network reduction |
| 1,024x H100, NDR + SHARPv3 | AllReduce (Tree+CollNet) | 8 GB | ~45-50 GB/s | Tail-bound; SHARP critical at this scale |
| 256x H100, NDR | AllToAll | 64 MB per pair | ~38-44 GB/s | MoE workload typical |
| 8x B200, single rack, NVLink 5 | AllReduce (NVLS) | 1 GB | ~860-920 GB/s | Double Hopper, with NVSwitch SHARP |
| 72x GB200 NVL72, single NVLink domain | AllReduce (NVLS) | 8 GB | ~750-820 GB/s | Cross-rack via NVLink switch |
Observability#
`NCCL_DEBUG=INFO` produces a per-rank log showing the discovered topology, chosen algorithm per collective, NIC bindings, and any fallback decisions. For production, parse this log on job start to assert expected configuration; the first 200 lines of NCCL output are the most diagnostic signal in your stack. Key indicators to look for are listed below.
Beyond the log, job-level performance counters — bytes transferred per collective, time per collective, queue-pair counters — are exposed via the NVIDIA Resiliency Extension (`nvidia-resiliency-ext`), via UFM Telemetry (port-level on InfiniBand), and via the framework's own profiler. PyTorch Profiler captures per-collective op timing; Nsight Systems shows the full GPU timeline including NCCL kernels.
- `NCCL INFO Bootstrap: Using <interface>:<ip>` — confirms rendezvous interface; should match your control-plane NIC, not RDMA NICs.
- `NCCL INFO Channel XX/YY` and `NCCL INFO Using network IB` — confirms IB/RoCE path is active, not TCP fallback.
- `NCCL INFO Connected ... using CollNet` — confirms SHARP CollNet plugin loaded and active. Silent absence = SHARP fallback to ring; major perf hit.
- `NCCL INFO NCCL_NTHREADS set by environment` and `NCCL INFO Algorithm`/`Protocol` lines — confirm tuning hints took effect.
- `NCCL WARN ... falling back to ...` — never benign; investigate.
# Parse a NCCL log for the headline indicators
grep -E "NCCL INFO (Bootstrap|Using network|Connected.*CollNet|Algorithm|Protocol)" \
/var/log/nccl/rank-*.log | sort -u
# Per-rank algorithm choice (should be uniform across ranks)
grep "NCCL INFO Setting affinity" /var/log/nccl/rank-*.log
# Hunt for fallbacks (any line is a red flag)
grep -E "NCCL WARN|falling back|disabled" /var/log/nccl/rank-*.log
# Live perf counters via UFM (InfiniBand)
ufm_rest_cli get ports/counters --filter "rx_pause_count > 0 OR symbol_errors > 0"Cost and FinOps#
NCCL itself is BSD-licensed and free; the cost it drives is GPU-hour waste from suboptimal collective performance. A 20 % AllReduce regression on a 1,024-GPU H100 training run translates directly to a ~20 % longer run and ~$50-200k in additional GPU-hours per training week, depending on the negotiated rate.
The FinOps levers are:
- Validate NCCL_IB_HCA pinning at job start. Saves 30-50% AllReduce throughput on misconfigured nodes; pays for itself in hours.
- Enable SHARP/NVLS on supported fabrics. 10-30% AllReduce uplift at scale; free if the hardware is already there.
- Pin a single NCCL release per cluster. Mixed-version deadlocks cost full multi-day re-runs (~$200k+ on a 1,024-H100 job).
- Use `nccl-tests` as the GPU-cluster smoke test before every multi-week training run. A 15-minute validation up front catches the issues that otherwise surface 4 days into a 14-day run.
- Avoid TCP fallback in production. `NCCL_SOCKET_IFNAME=^lo,docker` + `NCCL_IB_DISABLE=0` ensures the IB/RoCE path is used; TCP fallback at 25/100/400 GbE is 5-20x slower.
Security and compliance#
NCCL itself does not implement authentication or encryption between ranks — it assumes a trusted underlying fabric. For multi-tenant clusters, the relevant isolation primitives are InfiniBand PKeys (partition keys, fabric-level VLAN-equivalent) and the GPU's own MIG / Confidential Compute mode (NVIDIA Hopper+ CC-on with attested AES-256-GCM PCIe and HBM encryption).
Operational hardening: run jobs in containers with NCCL traffic confined to dedicated CNI networks, scrub `NCCL_DEBUG_FILE` paths to per-job directories, and never share `/dev/infiniband/*` devices across non-cooperating tenants.
Compliance: NCCL has no compliance posture of its own. Inherit it from the surrounding stack — DGX-bundled NCCL ships under NVIDIA's enterprise support and qualifies for HIPAA/SOC 2 reference designs; community NCCL is FOSS BSD-3 and treated as a library dependency for software bill of materials (SBOM, e.g. SPDX).
Migration and alternatives#
On NVIDIA GPUs there is essentially no NCCL alternative in production — every framework defaults to it. On AMD GPUs the analogue is RCCL (ROCm Collective Communications Library, an NCCL-compatible ABI fork); on AWS Trainium it is the Neuron Collective Communications library; on Intel Gaudi it is HCCL.
- Switching from NCCL to RCCL on AMD is usually a one-line backend swap (`backend='rccl'` in PyTorch) plus a `LD_PRELOAD` to the RCCL shim. Test thoroughly: NCCL feature parity gap (NVLS, CollNet) means topology-specific tuning differs.
- MSCCLang is worth investigating only when your collective is bespoke (custom topology, non-power-of-2 GPU counts) — write the algorithm in MSCCLang, compile to MSCCL, register as a NCCL plugin.
- Falling back to Gloo for GPU collectives is an anti-pattern — Gloo's GPU path is host-staged and 10-50x slower than NCCL. Only use Gloo for CPU-side coordination (rendezvous, barriers).
| Alternative | Platform | API compatibility with NCCL | Maturity |
|---|---|---|---|
| RCCL | AMD MI200/MI300 | ABI-compatible drop-in | Production; gap closing vs NCCL |
| oneCCL | Intel CPU/GPU | Similar API, different headers | Production for Intel-only stacks |
| HCCL | Intel Gaudi 2/3 | Habana-specific | Production within Gaudi |
| Neuron CC | AWS Trainium / Inferentia | Neuron SDK-specific | Production within AWS |
| MSCCL / MSCCLang | Research / NVIDIA | NCCL-compatible, custom algorithms | Research-grade, NVIDIA-internal optimisations |
| Gloo | Any (PyTorch backend) | Lower-level, no GPU optimisation | Legacy; CPU-only AllReduce in production |
Troubleshooting#
The vast majority of NCCL incidents fall into a small number of categories. The table below maps the symptom you see to the first thing to check.
| Symptom | Most likely cause | First action |
|---|---|---|
| Job hangs at startup, no NCCL log lines | Rendezvous failure (firewall, wrong interface) | Set NCCL_DEBUG=INFO; verify NCCL_SOCKET_IFNAME; check bootstrap port reachability |
| NCCL hangs mid-step, all ranks stuck | GPU OOM on one rank, or NCCL_IB_TIMEOUT too tight on lossy fabric | Check `dmesg` on all hosts; raise NCCL_IB_TIMEOUT to 24; verify ECN/PFC on RoCE |
| AllReduce throughput half of expected | Wrong NIC affinity (cross-NUMA) or PCIe path instead of NVLink | Dump topology with NCCL_TOPO_DUMP_FILE; pin NCCL_IB_HCA in NUMA order; set NCCL_P2P_LEVEL=NVL |
| AllReduce throughput much less than half expected | TCP fallback active (IB/RoCE init failed) | grep for `Using network Socket` in NCCL log; fix IB driver / GID index |
| `unhandled cuda error` during collective | Mismatched CUDA versions across ranks, or buffer freed before kernel complete | Pin CUDA + driver version per node; ensure synchronisation before tensor reuse |
| SHARP/CollNet absent silently | libnccl-net.so / libsharp.so not on loader path, or Aggregation Manager not running | Check `ldconfig -p | grep nccl`; verify `sharp_am` process on the head node |
| NCCL OOM (`ncclSystemError: out of memory`) | NCCL_BUFFSIZE too large for available GPU memory | Lower NCCL_BUFFSIZE from default 8 MiB to 4 MiB; reduce NCCL_MAX_NCHANNELS |
| RoCEv2: AllReduce collapses under load | PFC/ECN not configured or wrong DSCP/PCP mapping | Verify NCCL_IB_TC matches DSCP on switches; check PFC pause counters on TOR |
| Mixed-version deadlock at init | Different NCCL versions across nodes | Standardise NCCL version per cluster; pin in container images |
| Collective slows over time | Memory fragmentation in cudaMallocFromPoolAsync, or SHARP AN exhaustion | Restart job; check Aggregation Manager AN allocation |
The most expensive NCCL incident class is silent: SHARP enabled, the env var set, but the CollNet plugin failing to load — the job runs to completion at ring-AllReduce performance and the operator never notices the missing 20-30 % uplift. Always grep for `Connected ... using CollNet` in the first job's NCCL log on every new cluster.
Where this fits in the Yobitel stack#
Every multi-GPU training and large-context inference workload running on Yobitel's GPU cloud, NeoCloud Tier-III pods, or sovereign UK clusters goes through NCCL. Yobitel's fabric defaults — Quantum-2 NDR with SHARPv3 on the H100/H200 pods, Quantum-3 XDR with SHARPv4 on the Blackwell pods, Spectrum-X RoCEv2 on the Ethernet-preferring sovereign pods — are tuned so that the default NCCL env-var profile shipped with the cluster image hits the bandwidth targets in the sizing table above without per-job tweaking.
Customers running on Yobibyte do not need to touch NCCL directly — the managed inference and fine-tune surface handles topology pinning, NIC affinity and SHARP enablement automatically. Customers running on raw Yobitel GPU Cloud get a `/etc/profile.d/nccl.sh` baked into the image with the cluster-appropriate variables already set, plus the `nccl-tests` binaries pre-installed at `/opt/nccl-tests/build/`.
References
- NCCL Documentation · NVIDIA
- NCCL Environment Variables Reference · NVIDIA
- NCCL GitHub Repository · NVIDIA
- nccl-tests Benchmark Suite · NVIDIA
- Massively Scale Your Deep Learning Training with NCCL 2.x · NVIDIA Developer Blog
- PyTorch Distributed: NCCL Backend · PyTorch