TL;DR
- InfiniBand XDR is the IBTA-defined 800 Gb/s per-port generation, doubling NDR's bandwidth by running 4 lanes of 200 Gb/s PAM4 signalling per port.
- Pairs with the NVIDIA Quantum-3 switch and ConnectX-8 host channel adapter, the reference fabric for B200, GB200, and GB300 NVL training clusters.
- Retains InfiniBand's credit-based lossless flow control, hardware RDMA, sub-microsecond cut-through switching, and SHARPv4 in-network reduction.
- Volume shipments began in 2024-2025; XDR is positioned against 800G Spectrum-X Ethernet for frontier-model training fabrics through 2027.
Overview#
InfiniBand XDR is the sixth generation of the InfiniBand standard, succeeding NDR. Where NDR uses four lanes of 100 Gb/s PAM4 to reach 400 Gb/s per port, XDR doubles the per-lane signalling rate to 200 Gb/s PAM4, keeping the same four-lane port format and arriving at 800 Gb/s per port. The aggregated XDR2 link uses eight lanes for 1.6 Tb/s on a single cable.
XDR was timed to match the Blackwell GPU generation. A single B200 GPU's HBM3e and NVLink 5 bandwidth would saturate an NDR port during cross-pod collectives, so the fabric had to scale alongside the accelerators. NVIDIA's Quantum-3 platform — the XDR switch family — and the ConnectX-8 HCA together form the reference fabric for the GB200 NVL72 and GB300 NVL72 reference designs.
Specifications#
| Property | Value |
|---|---|
| Per-lane signalling | 200 Gb/s PAM4 |
| Lanes per standard port | 4 |
| Per-port line rate | 800 Gb/s (XDR) |
| Aggregated link (XDR2) | 1.6 Tb/s (8 lanes) |
| Encoding | PAM4 with RS-FEC |
| Switch ASIC | NVIDIA Quantum-3 |
| Host adapter | ConnectX-8 |
| Connector | OSFP-XD twin-port |
| Switch port-to-port latency | Cut-through, sub-microsecond |
| First volume shipments | 2024-2025 |
| In-network reduction | SHARPv4 |
Why XDR Mattered for Blackwell#
The argument for XDR is bandwidth balance. A GB200 NVL72 rack ties 72 Blackwell GPUs into a single NVLink domain at ~130 TB/s of bisection bandwidth. When two such racks need to exchange gradients during AllReduce, the inter-rack fabric is the bottleneck — and an NDR port is not enough to keep tensor parallelism stages from stalling at scale.
XDR's 800 Gb/s per port aligns the inter-rack fabric capacity to the rate at which NVLink can drain a Blackwell's HBM. In practice this lets a 4,096-GPU pod hit higher sustained MFU on long-context training than the equivalent NDR fabric, particularly when expert-parallel mixture-of-experts models thrash the all-to-all collective.
Operational Notes#
- Cabling moves from OSFP to OSFP-XD; existing NDR DACs and AOCs do not slot into XDR cages without adapters.
- Passive copper reach shrinks at higher signalling rates — most XDR runs beyond ~2 m use linear pluggable optics (LPO) or active optical cables.
- Quantum-3 supports SHARPv4 with deeper reduction trees and lower precision (BF16/FP8) — enable explicitly in NCCL for AllReduce-bound jobs.
- Subnet Manager scaling: UFM 6.x is recommended for clusters above 1,024 nodes due to faster topology re-convergence.
- Mixed NDR/XDR subnets are supported but cap the slower side; segregate pods if budget allows.
XDR optics dissipate noticeably more heat than NDR equivalents. Rack-level airflow and switch fan curves should be re-validated when transitioning a hall from NDR to XDR.
Pitfalls#
- Driver and firmware drift between ConnectX-7 (NDR) and ConnectX-8 (XDR) is common during phased migrations; standardise DOCA release per cluster.
- Some early XDR cables exhibited tighter bend-radius tolerances than NDR — survey rack cabling plans before installation.
- Power draw per port roughly doubles versus NDR; PDU budgeting at the top-of-rack switch tier must be revisited.
References
- InfiniBand Architecture Specification · InfiniBand Trade Association
- NVIDIA Quantum-3 InfiniBand Platform · NVIDIA
- NVIDIA ConnectX-8 SuperNIC Product Brief · NVIDIA