TL;DR
- Spine-leaf is a two-tier folded-Clos topology: every leaf switch connects to every spine switch, no leaf connects to another leaf, and no spine connects to another spine — yielding predictable two-hop, uniform-latency paths.
- Non-blocking when total spine bandwidth equals or exceeds the sum of leaf uplink bandwidth (1:1 subscription); 2:1 or 4:1 oversubscription is acceptable for inference and storage tiers, almost never for tightly-coupled training.
- Variants matter: classic spine-leaf, collapsed-core (small pods), super-spine three-tier (for multi-pod scale), and rail-optimised spine-leaf (the AI-fabric form where each GPU's NICs map to independent rails).
- Versus alternatives — simpler and lower diameter than fat-tree at small to mid scale, cheaper to operate than dragonfly under ~16k endpoints, and far more deterministic than mesh or 3D-torus for AI collectives.
- Yobitel NeoCloud reference designs use rail-optimised spine-leaf with InfiniBand NDR or 800G Ethernet underneath; Yobibyte training pods schedule across rails to keep NCCL collective traffic spine-local.
Overview#
The spine-leaf topology is a folded form of the Clos network introduced by Charles Clos in 1953 for telephone switching, then rediscovered for data centres in the late 2000s as east-west traffic grew. The shape is simple: two tiers of switches, fully bipartite. Every leaf (access switch) has a link to every spine (aggregation switch); no leaf is wired to another leaf and no spine is wired to another spine. Endpoints — servers, GPU baseboards, storage controllers, NICs — attach only to leaves.
The geometry produces three properties that make the design hard to beat at typical data centre scale. First, latency is uniform: any two endpoints on different leaves are exactly two switch hops apart. Second, paths are diverse: between any pair of leaves there are exactly N equal-cost paths, where N is the number of spines, so ECMP or InfiniBand adaptive routing can spread flows trivially. Third, expansion is incremental: add spines for more bandwidth, add leaves for more endpoints, and the fabric remains a regular bipartite graph.
For AI infrastructure, spine-leaf is the practical anchor for everything below the very top end of training. Storage, management, multi-tenant inference, and small-to-mid training pods all sit on spine-leaf fabrics. The frontier-scale training clusters that need 16k+ GPUs add a third tier and become fat-trees (which are spine-leaf generalisations) but the design idea is the same. Yobitel NeoCloud's pod-level building block is a rail-optimised spine-leaf at 1,024 GPUs; the equivalent Yobibyte training-pod abstraction the customer sees is just a region with a guaranteed bisection number behind it.
This entry helps you decide whether a spine-leaf fabric is the right shape for your AI cluster, how to size it without surprises, and how the rail-optimised variant Yobitel uses on NeoCloud differs from a generic spine-leaf you might run for a storage tier.
How it works#
Mechanically, every leaf has its ports split into two sets: a southbound set that faces endpoints, and a northbound set that connects to spines. If a leaf has 64 ports of 400 Gb/s, a common 1:1 design uses 32 southbound (for endpoints) and 32 northbound (one to each of 32 spines). Spines have only northbound-equivalent ports — every port goes down to a leaf.
Any path between two endpoints on different leaves traverses leaf-southbound-port, leaf-ASIC, leaf-northbound-port, optic, spine-port, spine-ASIC, spine-port, optic, destination-leaf-northbound-port, destination-leaf-ASIC, destination-leaf-southbound-port. Modern cut-through switches add roughly 300-600 nanoseconds per hop. Endpoints on the same leaf take a single hop.
Routing is normally Layer 3 to every leaf. The underlay is BGP unnumbered (RFC 7938) or an OSPF underlay with BGP EVPN overlay; tenant Layer 2 is delivered through VXLAN or EVPN VXLAN to keep broadcast domains off the fabric. ECMP at every layer spreads flows across spines; on InfiniBand spine-leaf fabrics the same role is played by Quantum-2/3 adaptive routing.
The defining sizing question is the leaf-uplink to leaf-downlink bandwidth ratio. A 1:1 design is non-blocking: every endpoint can send at line rate to any other endpoint, including in the worst-case adversarial permutation. A 2:1 design halves the uplink bandwidth, which halves the cost of spine ports and optics but caps cross-leaf throughput at half line rate. Training fabrics demand 1:1 because AllReduce stresses every part of the bisection at once; inference and storage fabrics tolerate 2:1 because traffic is more north-south.
- Two tiers only: leaf and spine. Adding a third tier turns the topology into a super-spine fat-tree (see Variants below).
- Fully bipartite: every leaf has exactly one link to every spine (or N equal-cost links when port counts allow more than one).
- Uniform two-hop diameter: every cross-leaf flow sees the same number of hops, removing one source of tail-latency variance in distributed collectives.
- Failure-graceful: losing one spine reduces capacity by 1/N but does not partition the fabric.
- Bandwidth scaling is linear in the number of spines, capacity scaling is linear in the number of leaves — both are independent levers.
Variants and architectural choices#
The basic spine-leaf shape has several common refinements. Pick the one that matches your scale, your fabric technology, and (for AI clusters) the locality model of your collectives. Yobitel NeoCloud uses rail-optimised spine-leaf for training pods and classic spine-leaf for storage and management.
- Rail-optimised spine-leaf is the AI-fabric form. With 4 HCAs per HGX H100/H200 baseboard, you build 4 parallel spine-leaf planes; HCA1 on every host shares a plane, HCA2 on every host shares the next plane, and so on. NCCL pins each communication channel to a specific HCA, so collective traffic stays on its rail and never contends with other rails inside the fabric.
- Quantum-3 with the InfiniBand Director chassis can collapse a three-tier fat-tree back into a two-tier spine-leaf for clusters up to ~16k endpoints — fewer cables, fewer hops, same bisection.
- Collapsed-core saves rack units and cable count at small scale but stops being useful above ~512 endpoints; expand to classic spine-leaf at that point.
- On Ethernet, EVPN VXLAN over a BGP unnumbered underlay is the standard control plane. On InfiniBand, OpenSM or NVIDIA UFM is mandatory and there is no separate overlay — partitions (PKeys) do tenant isolation.
| Variant | What changes | When to use | Trade-off |
|---|---|---|---|
| Classic spine-leaf | Pure two-tier bipartite graph | General data centre, storage, inference fleets up to ~2k endpoints | Capped by leaf radix × number of leaves |
| Collapsed-core | Spine and core merged into a single tier of larger switches | Small pods of a few hundred endpoints | Less granular expansion; single-vendor lock-in to large chassis |
| Super-spine three-tier | Adds a third tier above pods of spine-leaf for multi-pod scale | Multi-pod sites; >10k endpoints | More cables, more hops, more operational complexity |
| Rail-optimised spine-leaf | Each GPU's NICs map to N independent rails, each its own spine-leaf plane | AI training fabrics where NCCL benefits from per-rail locality | Tight cabling discipline; rails are independent failure domains |
| Dual-stack (Ethernet + IB) | Separate spine-leaf fabrics for management/storage (Ethernet) and GPU collectives (InfiniBand) | Sites that want IB for AllReduce but Ethernet familiarity for ops | Two fabrics to operate; two telemetry pipelines |
Sizing and capacity planning#
The arithmetic for a spine-leaf fabric is straightforward once you fix the leaf radix, the uplink-to-downlink ratio, and the spine count. With K-port leaves split L uplinks / D downlinks per leaf, you get: maximum spines ≤ L, maximum endpoints per leaf = D, maximum endpoints = (number of leaves) × D, total cross-leaf bisection bandwidth = (number of leaves) × L × per-port-rate. Non-blocking requires L × per-port-rate ≥ D × per-port-rate, i.e. L ≥ D.
For Yobitel NeoCloud's standard 1,024-GPU training pod, the numbers land at 32 Quantum-2 leaves and 16 Quantum-2 spines (each leaf has 32 ports downstream to 32 GPU NICs and 32 ports upstream split across the 16 spines as two links each) — a 1:1 non-blocking InfiniBand NDR spine-leaf with a 32 × 16 × 400 Gb/s bisection of 204.8 Tb/s. The same arithmetic applied to 800G Ethernet (Tomahawk 5 or Spectrum-X SN5600) gives an equivalent 1,024-GPU pod with 16 leaves and 8 spines.
- Yobitel NeoCloud's UK training-pod reference design lands on the Quantum-2 32-leaf × 16-spine non-blocking configuration as the standard sovereign-region building block; the Yobibyte training-pod abstraction the customer schedules into is one of these pods (or, for larger fine-tunes, an aggregation of several).
- Inference fabrics (Yobitel NeoCloud's inference-optimised pods, Yobibyte managed inference endpoints) typically use a 2:1 oversubscribed spine-leaf because inference traffic is mostly north-south to the gateway, not east-west between GPUs.
- Storage fabrics use a separate spine-leaf at 400 Gb/s Ethernet (Spectrum-3 or Tomahawk 4) — high enough for NVMe-oF and parallel filesystem traffic, dedicated so RDMA storage flows do not contend with training collectives.
- Beyond ~2,048 endpoints on a single Quantum-2 spine-leaf, the leaf-radix arithmetic forces a third tier. Use a fat-tree (see related entry) or a Quantum-3 Director-based collapsed two-tier design.
- Power: each Quantum-2 leaf draws ~750 W, each Quantum-3 leaf draws ~1.2 kW. Budget ~1-1.5 kW per spine-leaf switch when PDU sizing.
| Fabric tech | Leaf SKU | Endpoints / leaf | Leaves × spines for 1,024 endpoints | Cross-leaf bisection |
|---|---|---|---|---|
| InfiniBand NDR | Quantum-2 (64 × 400 Gb/s) | 32 | 32 × 16 | 204.8 Tb/s |
| InfiniBand XDR | Quantum-3 (64 × 800 Gb/s) | 32 | 32 × 16 | 409.6 Tb/s |
| 800G Ethernet | Spectrum-X SN5600 (64 × 800 Gb/s) | 32 | 32 × 16 | 409.6 Tb/s |
| 800G Ethernet (compact) | Tomahawk 5 (64 × 800 Gb/s) | 64 | 16 × 8 | 409.6 Tb/s |
| 400G Ethernet | Spectrum-3 / Tomahawk 4 (32 × 400 Gb/s) | 16 | 64 × 16 | 102.4 Tb/s |
Versus alternatives#
Spine-leaf is the right default for most data centre fabrics; the question is when something else wins.
- Versus fat-tree: a fat-tree is a spine-leaf with an extra tier above the spines. If two tiers fit your scale, use spine-leaf — fewer hops, fewer cables, lower latency, simpler ops.
- Versus dragonfly: dragonfly minimises long-haul optical cable cost by grouping endpoints and routing between groups, but pays for it with longer worst-case paths and more demanding adaptive routing. Worthwhile at DOE-exascale scale, almost never worthwhile in commercial AI.
- Versus 3D-torus and mesh: useful for nearest-neighbour HPC workloads but a bad fit for AllReduce-dominated AI collectives, which stress diagonal and long-range paths the same as nearest-neighbour ones.
- Yobitel NeoCloud's choice is spine-leaf up to 1,024 GPUs per pod and fat-tree from 2,048 to 16,000 GPUs per pod. Beyond that, Director-based collapsed two-tier designs come back into play.
| Topology | Diameter | Cable scaling | Best at | Avoid when |
|---|---|---|---|---|
| Spine-leaf (2-tier) | 2 hops cross-leaf | O(N) | General data centre, inference fleets, training pods up to ~2k endpoints | Above ~2k endpoints on commodity radix |
| Fat-tree (3-tier) | 4 hops cross-pod | O(N^(3/2)) | Training clusters from ~2k to ~16k endpoints with full bisection | Smaller scales where the extra tier adds cost without benefit |
| Dragonfly / Dragonfly+ | 3 hops typical, longer worst-case | O(N) with long-haul links | Supercomputing systems with strict optics-cost budgets at 10k+ endpoints | Commercial AI where adaptive-routing complexity is unjustified |
| 3D-torus | O(N^(1/3)) hops | O(N) | Legacy HPC; specific science workloads | Modern training where AllReduce dominates |
| Hypercube / mesh | O(log N) or O(N^(1/2)) | Mixed | Niche academic / chip-fabric use | Data centre AI |
Trade-offs and known limitations#
- Leaf radix bounds the design: a fabric of K-port leaves with L=K/2 northbound and D=K/2 southbound ports tops out at L × D = K^2/4 endpoints per spine-leaf. For K=64 that is 1,024 endpoints; above that, add a tier or move to higher-radix switches.
- Oversubscription is a knife edge for AI: 1:1 is correct for training, 2:1 silently collapses AllReduce throughput by ~50%, and 4:1 is unusable for anything tightly-coupled. Document the ratio and audit it after every fabric change.
- ECMP polarisation on Ethernet: long-lived elephant flows can hash to the same spine and create persistent hotspots. Mitigated by dynamic load balancing (DLB), packet-level spraying, or InfiniBand adaptive routing.
- Cable count: a 32-leaf × 16-spine fabric has 32 × 16 = 512 inter-tier cables; a 64-leaf × 32-spine fabric has 2,048. Cable trays, patch panels and labelling discipline matter much more than the topology diagram suggests.
- Mixing leaf and spine speeds (e.g. 400G leaves with 800G spines) is supported but creates a per-flow rate cap at the slower side; either run a uniform speed or accept the bottleneck explicitly.
- Multi-tenant Layer 2: a single broadcast domain across leaves is an anti-pattern. Use VXLAN/EVPN on Ethernet or PKeys on InfiniBand for tenant isolation.
- Rail-optimised cabling demands per-cable audit. A single mis-cabled rail link is invisible until NCCL collective performance drops on one rail.
Oversubscription is the single biggest correctness trap. A reader who sizes the spine count from L × number-of-leaves but forgets to check 1:1 will quietly ship a 2:1 fabric — and only notice when the first multi-node training run prints half the expected NCCL bandwidth.
Implementation notes#
Practical building blocks and where they sit in the modern operator's toolchain. None of this is novel — the discipline is in choosing one stack per pod and not letting the operations team drift.
- InfiniBand spine-leaf: Quantum-2 (NDR) or Quantum-3 (XDR) leaves and spines, ConnectX-7 or ConnectX-8 HCAs at endpoints, OpenSM or NVIDIA UFM as Subnet Manager. Partition keys for tenant separation; adaptive routing and SHARPv3/SHARPv4 for collective acceleration.
- Ethernet spine-leaf: Tomahawk 5, Spectrum-X SN5600, Silicon One G200 or Teralynx 10 leaves and spines; ConnectX-7/8 or BlueField-3/4 endpoints; BGP unnumbered underlay with EVPN VXLAN overlay (FRR, SONiC, Arista EOS, NVIDIA Cumulus Linux).
- Rail-optimised AI fabric: per-rail planes wired independently; NCCL_IB_HCA pinned to per-rank HCA assignment; rail-aware NCCL topology file generated by `nvidia-smi topo`.
- Storage tier: separate 400G Ethernet spine-leaf with RoCEv2 + PFC/ECN/DCQCN tuning, dedicated VLAN/EVPN tenant.
- Observability: per-port telemetry (`ethtool -S`, UFM telemetry, sFlow, gNMI streaming), PFC pause counters and ECN marks for Ethernet RoCE fabrics, SHARP tree health from UFM for IB.
- Yobitel NeoCloud chooses spine-leaf and rail-optimised spine-leaf at the pod level and exposes the result to Yobibyte and InferenceBench customers as opaque regions; customers see latency and bandwidth, not the underlying topology.
Where it fits in the Yobitel stack#
Spine-leaf is the underlying shape of every Yobitel-operated fabric. Yobitel NeoCloud GPU pods are built on rail-optimised spine-leaf InfiniBand NDR (and XDR for the newest Blackwell pods); NeoCloud storage and management tiers ride classic Ethernet spine-leaf. The Yobibyte managed platform schedules workloads onto these pods without exposing the fabric — but its placement engine relies on rail-locality information so multi-node training jobs see uncontested NCCL bandwidth.
InferenceBench measurements run on the same NeoCloud pods, so the throughput-versus-batch curves published on the leaderboard reflect a real production rail-optimised spine-leaf fabric, not a synthetic benchmark rig. For customers planning their own non-Yobitel deployments, the entries linked below explain the Ethernet, InfiniBand, and fat-tree options that compose with spine-leaf.
References
- A Study of Non-Blocking Switching Networks (Clos, 1953) · Bell System Technical Journal
- Spine-and-Leaf Architecture (Cisco) · Cisco
- RFC 7938 — Use of BGP for Routing in Large-Scale Data Centres · IETF
- A Scalable, Commodity Data Center Network Architecture (Al-Fares et al, 2008) · SIGCOMM 2008
- NVIDIA DGX SuperPOD Reference Architecture · NVIDIA