Spine-Leaf Topology

TL;DR

Spine-leaf is a two-tier folded-Clos topology: every leaf switch connects to every spine switch, no leaf connects to another leaf, and no spine connects to another spine — yielding predictable two-hop, uniform-latency paths.
Non-blocking when total spine bandwidth equals or exceeds the sum of leaf uplink bandwidth (1:1 subscription); 2:1 or 4:1 oversubscription is acceptable for inference and storage tiers, almost never for tightly-coupled training.
Variants matter: classic spine-leaf, collapsed-core (small pods), super-spine three-tier (for multi-pod scale), and rail-optimised spine-leaf (the AI-fabric form where each GPU's NICs map to independent rails).
Versus alternatives — simpler and lower diameter than fat-tree at small to mid scale, cheaper to operate than dragonfly under ~16k endpoints, and far more deterministic than mesh or 3D-torus for AI collectives.
Yobitel NeoCloud reference designs use rail-optimised spine-leaf with InfiniBand NDR or 800G Ethernet underneath; Yobibyte training pods schedule across rails to keep NCCL collective traffic spine-local.

Overview#

The spine-leaf topology is a folded form of the Clos network introduced by Charles Clos in 1953 for telephone switching, then rediscovered for data centres in the late 2000s as east-west traffic grew. The shape is simple: two tiers of switches, fully bipartite. Every leaf (access switch) has a link to every spine (aggregation switch); no leaf is wired to another leaf and no spine is wired to another spine. Endpoints — servers, GPU baseboards, storage controllers, NICs — attach only to leaves.

The geometry produces three properties that make the design hard to beat at typical data centre scale. First, latency is uniform: any two endpoints on different leaves are exactly two switch hops apart. Second, paths are diverse: between any pair of leaves there are exactly N equal-cost paths, where N is the number of spines, so ECMP or InfiniBand adaptive routing can spread flows trivially. Third, expansion is incremental: add spines for more bandwidth, add leaves for more endpoints, and the fabric remains a regular bipartite graph.

For AI infrastructure, spine-leaf is the practical anchor for everything below the very top end of training. Storage, management, multi-tenant inference, and small-to-mid training pods all sit on spine-leaf fabrics. The frontier-scale training clusters that need 16k+ GPUs add a third tier and become fat-trees (which are spine-leaf generalisations) but the design idea is the same. Yobitel NeoCloud's pod-level building block is a rail-optimised spine-leaf at 1,024 GPUs; the equivalent Yobibyte training-pod abstraction the customer sees is just a region with a guaranteed bisection number behind it.

This entry helps you decide whether a spine-leaf fabric is the right shape for your AI cluster, how to size it without surprises, and how the rail-optimised variant Yobitel uses on NeoCloud differs from a generic spine-leaf you might run for a storage tier.

How it works#

Mechanically, every leaf has its ports split into two sets: a southbound set that faces endpoints, and a northbound set that connects to spines. If a leaf has 64 ports of 400 Gb/s, a common 1:1 design uses 32 southbound (for endpoints) and 32 northbound (one to each of 32 spines). Spines have only northbound-equivalent ports — every port goes down to a leaf.

Any path between two endpoints on different leaves traverses leaf-southbound-port, leaf-ASIC, leaf-northbound-port, optic, spine-port, spine-ASIC, spine-port, optic, destination-leaf-northbound-port, destination-leaf-ASIC, destination-leaf-southbound-port. Modern cut-through switches add roughly 300-600 nanoseconds per hop. Endpoints on the same leaf take a single hop.

Routing is normally Layer 3 to every leaf. The underlay is BGP unnumbered (RFC 7938) or an OSPF underlay with BGP EVPN overlay; tenant Layer 2 is delivered through VXLAN or EVPN VXLAN to keep broadcast domains off the fabric. ECMP at every layer spreads flows across spines; on InfiniBand spine-leaf fabrics the same role is played by Quantum-2/3 adaptive routing.

The defining sizing question is the leaf-uplink to leaf-downlink bandwidth ratio. A 1:1 design is non-blocking: every endpoint can send at line rate to any other endpoint, including in the worst-case adversarial permutation. A 2:1 design halves the uplink bandwidth, which halves the cost of spine ports and optics but caps cross-leaf throughput at half line rate. Training fabrics demand 1:1 because AllReduce stresses every part of the bisection at once; inference and storage fabrics tolerate 2:1 because traffic is more north-south.

Two tiers only: leaf and spine. Adding a third tier turns the topology into a super-spine fat-tree (see Variants below).
Fully bipartite: every leaf has exactly one link to every spine (or N equal-cost links when port counts allow more than one).
Uniform two-hop diameter: every cross-leaf flow sees the same number of hops, removing one source of tail-latency variance in distributed collectives.
Failure-graceful: losing one spine reduces capacity by 1/N but does not partition the fabric.
Bandwidth scaling is linear in the number of spines, capacity scaling is linear in the number of leaves — both are independent levers.

Variants and architectural choices#

The basic spine-leaf shape has several common refinements. Pick the one that matches your scale, your fabric technology, and (for AI clusters) the locality model of your collectives. Yobitel NeoCloud uses rail-optimised spine-leaf for training pods and classic spine-leaf for storage and management.

Rail-optimised spine-leaf is the AI-fabric form. With 4 HCAs per HGX H100/H200 baseboard, you build 4 parallel spine-leaf planes; HCA1 on every host shares a plane, HCA2 on every host shares the next plane, and so on. NCCL pins each communication channel to a specific HCA, so collective traffic stays on its rail and never contends with other rails inside the fabric.
Quantum-3 with the InfiniBand Director chassis can collapse a three-tier fat-tree back into a two-tier spine-leaf for clusters up to ~16k endpoints — fewer cables, fewer hops, same bisection.
Collapsed-core saves rack units and cable count at small scale but stops being useful above ~512 endpoints; expand to classic spine-leaf at that point.
On Ethernet, EVPN VXLAN over a BGP unnumbered underlay is the standard control plane. On InfiniBand, OpenSM or NVIDIA UFM is mandatory and there is no separate overlay — partitions (PKeys) do tenant isolation.

Variant	What changes	When to use	Trade-off
Classic spine-leaf	Pure two-tier bipartite graph	General data centre, storage, inference fleets up to ~2k endpoints	Capped by leaf radix × number of leaves
Collapsed-core	Spine and core merged into a single tier of larger switches	Small pods of a few hundred endpoints	Less granular expansion; single-vendor lock-in to large chassis
Super-spine three-tier	Adds a third tier above pods of spine-leaf for multi-pod scale	Multi-pod sites; >10k endpoints	More cables, more hops, more operational complexity
Rail-optimised spine-leaf	Each GPU's NICs map to N independent rails, each its own spine-leaf plane	AI training fabrics where NCCL benefits from per-rail locality	Tight cabling discipline; rails are independent failure domains
Dual-stack (Ethernet + IB)	Separate spine-leaf fabrics for management/storage (Ethernet) and GPU collectives (InfiniBand)	Sites that want IB for AllReduce but Ethernet familiarity for ops	Two fabrics to operate; two telemetry pipelines

Sizing and capacity planning#

The arithmetic for a spine-leaf fabric is straightforward once you fix the leaf radix, the uplink-to-downlink ratio, and the spine count. With K-port leaves split L uplinks / D downlinks per leaf, you get: maximum spines ≤ L, maximum endpoints per leaf = D, maximum endpoints = (number of leaves) × D, total cross-leaf bisection bandwidth = (number of leaves) × L × per-port-rate. Non-blocking requires L × per-port-rate ≥ D × per-port-rate, i.e. L ≥ D.

For Yobitel NeoCloud's standard 1,024-GPU training pod, the numbers land at 32 Quantum-2 leaves and 16 Quantum-2 spines (each leaf has 32 ports downstream to 32 GPU NICs and 32 ports upstream split across the 16 spines as two links each) — a 1:1 non-blocking InfiniBand NDR spine-leaf with a 32 × 16 × 400 Gb/s bisection of 204.8 Tb/s. The same arithmetic applied to 800G Ethernet (Tomahawk 5 or Spectrum-X SN5600) gives an equivalent 1,024-GPU pod with 16 leaves and 8 spines.

Yobitel NeoCloud's UK training-pod reference design lands on the Quantum-2 32-leaf × 16-spine non-blocking configuration as the standard sovereign-region building block; the Yobibyte training-pod abstraction the customer schedules into is one of these pods (or, for larger fine-tunes, an aggregation of several).
Inference fabrics (Yobitel NeoCloud's inference-optimised pods, Yobibyte managed inference endpoints) typically use a 2:1 oversubscribed spine-leaf because inference traffic is mostly north-south to the gateway, not east-west between GPUs.
Storage fabrics use a separate spine-leaf at 400 Gb/s Ethernet (Spectrum-3 or Tomahawk 4) — high enough for NVMe-oF and parallel filesystem traffic, dedicated so RDMA storage flows do not contend with training collectives.
Beyond ~2,048 endpoints on a single Quantum-2 spine-leaf, the leaf-radix arithmetic forces a third tier. Use a fat-tree (see related entry) or a Quantum-3 Director-based collapsed two-tier design.
Power: each Quantum-2 leaf draws ~750 W, each Quantum-3 leaf draws ~1.2 kW. Budget ~1-1.5 kW per spine-leaf switch when PDU sizing.

Fabric tech	Leaf SKU	Endpoints / leaf	Leaves × spines for 1,024 endpoints	Cross-leaf bisection
InfiniBand NDR	Quantum-2 (64 × 400 Gb/s)	32	32 × 16	204.8 Tb/s
InfiniBand XDR	Quantum-3 (64 × 800 Gb/s)	32	32 × 16	409.6 Tb/s
800G Ethernet	Spectrum-X SN5600 (64 × 800 Gb/s)	32	32 × 16	409.6 Tb/s
800G Ethernet (compact)	Tomahawk 5 (64 × 800 Gb/s)	64	16 × 8	409.6 Tb/s
400G Ethernet	Spectrum-3 / Tomahawk 4 (32 × 400 Gb/s)	16	64 × 16	102.4 Tb/s

Versus alternatives#

Spine-leaf is the right default for most data centre fabrics; the question is when something else wins.

Versus fat-tree: a fat-tree is a spine-leaf with an extra tier above the spines. If two tiers fit your scale, use spine-leaf — fewer hops, fewer cables, lower latency, simpler ops.
Versus dragonfly: dragonfly minimises long-haul optical cable cost by grouping endpoints and routing between groups, but pays for it with longer worst-case paths and more demanding adaptive routing. Worthwhile at DOE-exascale scale, almost never worthwhile in commercial AI.
Versus 3D-torus and mesh: useful for nearest-neighbour HPC workloads but a bad fit for AllReduce-dominated AI collectives, which stress diagonal and long-range paths the same as nearest-neighbour ones.
Yobitel NeoCloud's choice is spine-leaf up to 1,024 GPUs per pod and fat-tree from 2,048 to 16,000 GPUs per pod. Beyond that, Director-based collapsed two-tier designs come back into play.

Topology	Diameter	Cable scaling	Best at	Avoid when
Spine-leaf (2-tier)	2 hops cross-leaf	O(N)	General data centre, inference fleets, training pods up to ~2k endpoints	Above ~2k endpoints on commodity radix
Fat-tree (3-tier)	4 hops cross-pod	O(N^(3/2))	Training clusters from ~2k to ~16k endpoints with full bisection	Smaller scales where the extra tier adds cost without benefit
Dragonfly / Dragonfly+	3 hops typical, longer worst-case	O(N) with long-haul links	Supercomputing systems with strict optics-cost budgets at 10k+ endpoints	Commercial AI where adaptive-routing complexity is unjustified
3D-torus	O(N^(1/3)) hops	O(N)	Legacy HPC; specific science workloads	Modern training where AllReduce dominates
Hypercube / mesh	O(log N) or O(N^(1/2))	Mixed	Niche academic / chip-fabric use	Data centre AI

Trade-offs and known limitations#

Leaf radix bounds the design: a fabric of K-port leaves with L=K/2 northbound and D=K/2 southbound ports tops out at L × D = K^2/4 endpoints per spine-leaf. For K=64 that is 1,024 endpoints; above that, add a tier or move to higher-radix switches.
Oversubscription is a knife edge for AI: 1:1 is correct for training, 2:1 silently collapses AllReduce throughput by ~50%, and 4:1 is unusable for anything tightly-coupled. Document the ratio and audit it after every fabric change.
ECMP polarisation on Ethernet: long-lived elephant flows can hash to the same spine and create persistent hotspots. Mitigated by dynamic load balancing (DLB), packet-level spraying, or InfiniBand adaptive routing.
Cable count: a 32-leaf × 16-spine fabric has 32 × 16 = 512 inter-tier cables; a 64-leaf × 32-spine fabric has 2,048. Cable trays, patch panels and labelling discipline matter much more than the topology diagram suggests.
Mixing leaf and spine speeds (e.g. 400G leaves with 800G spines) is supported but creates a per-flow rate cap at the slower side; either run a uniform speed or accept the bottleneck explicitly.
Multi-tenant Layer 2: a single broadcast domain across leaves is an anti-pattern. Use VXLAN/EVPN on Ethernet or PKeys on InfiniBand for tenant isolation.
Rail-optimised cabling demands per-cable audit. A single mis-cabled rail link is invisible until NCCL collective performance drops on one rail.

Oversubscription is the single biggest correctness trap. A reader who sizes the spine count from L × number-of-leaves but forgets to check 1:1 will quietly ship a 2:1 fabric — and only notice when the first multi-node training run prints half the expected NCCL bandwidth.

Implementation notes#

Practical building blocks and where they sit in the modern operator's toolchain. None of this is novel — the discipline is in choosing one stack per pod and not letting the operations team drift.

InfiniBand spine-leaf: Quantum-2 (NDR) or Quantum-3 (XDR) leaves and spines, ConnectX-7 or ConnectX-8 HCAs at endpoints, OpenSM or NVIDIA UFM as Subnet Manager. Partition keys for tenant separation; adaptive routing and SHARPv3/SHARPv4 for collective acceleration.
Ethernet spine-leaf: Tomahawk 5, Spectrum-X SN5600, Silicon One G200 or Teralynx 10 leaves and spines; ConnectX-7/8 or BlueField-3/4 endpoints; BGP unnumbered underlay with EVPN VXLAN overlay (FRR, SONiC, Arista EOS, NVIDIA Cumulus Linux).
Rail-optimised AI fabric: per-rail planes wired independently; NCCL_IB_HCA pinned to per-rank HCA assignment; rail-aware NCCL topology file generated by `nvidia-smi topo`.
Storage tier: separate 400G Ethernet spine-leaf with RoCEv2 + PFC/ECN/DCQCN tuning, dedicated VLAN/EVPN tenant.
Observability: per-port telemetry (`ethtool -S`, UFM telemetry, sFlow, gNMI streaming), PFC pause counters and ECN marks for Ethernet RoCE fabrics, SHARP tree health from UFM for IB.
Yobitel NeoCloud chooses spine-leaf and rail-optimised spine-leaf at the pod level and exposes the result to Yobibyte and InferenceBench customers as opaque regions; customers see latency and bandwidth, not the underlying topology.

Where it fits in the Yobitel stack#

Spine-leaf is the underlying shape of every Yobitel-operated fabric. Yobitel NeoCloud GPU pods are built on rail-optimised spine-leaf InfiniBand NDR (and XDR for the newest Blackwell pods); NeoCloud storage and management tiers ride classic Ethernet spine-leaf. The Yobibyte managed platform schedules workloads onto these pods without exposing the fabric — but its placement engine relies on rail-locality information so multi-node training jobs see uncontested NCCL bandwidth.

InferenceBench measurements run on the same NeoCloud pods, so the throughput-versus-batch curves published on the leaderboard reflect a real production rail-optimised spine-leaf fabric, not a synthetic benchmark rig. For customers planning their own non-Yobitel deployments, the entries linked below explain the Ethernet, InfiniBand, and fat-tree options that compose with spine-leaf.

References

A Study of Non-Blocking Switching Networks (Clos, 1953) · Bell System Technical Journal
Spine-and-Leaf Architecture (Cisco) · Cisco
RFC 7938 — Use of BGP for Routing in Large-Scale Data Centres · IETF
A Scalable, Commodity Data Center Network Architecture (Al-Fares et al, 2008) · SIGCOMM 2008
NVIDIA DGX SuperPOD Reference Architecture · NVIDIA

TL;DR

Spine-leaf is a two-tier folded-Clos topology: every leaf switch connects to every spine switch, no leaf connects to another leaf, and no spine connects to another spine — yielding predictable two-hop, uniform-latency paths.
Non-blocking when total spine bandwidth equals or exceeds the sum of leaf uplink bandwidth (1:1 subscription); 2:1 or 4:1 oversubscription is acceptable for inference and storage tiers, almost never for tightly-coupled training.
Variants matter: classic spine-leaf, collapsed-core (small pods), super-spine three-tier (for multi-pod scale), and rail-optimised spine-leaf (the AI-fabric form where each GPU's NICs map to independent rails).
Versus alternatives — simpler and lower diameter than fat-tree at small to mid scale, cheaper to operate than dragonfly under ~16k endpoints, and far more deterministic than mesh or 3D-torus for AI collectives.
Yobitel NeoCloud reference designs use rail-optimised spine-leaf with InfiniBand NDR or 800G Ethernet underneath; Yobibyte training pods schedule across rails to keep NCCL collective traffic spine-local.

Overview#

How it works#

Two tiers only: leaf and spine. Adding a third tier turns the topology into a super-spine fat-tree (see Variants below).
Fully bipartite: every leaf has exactly one link to every spine (or N equal-cost links when port counts allow more than one).
Uniform two-hop diameter: every cross-leaf flow sees the same number of hops, removing one source of tail-latency variance in distributed collectives.
Failure-graceful: losing one spine reduces capacity by 1/N but does not partition the fabric.
Bandwidth scaling is linear in the number of spines, capacity scaling is linear in the number of leaves — both are independent levers.

Variants and architectural choices#

Rail-optimised spine-leaf is the AI-fabric form. With 4 HCAs per HGX H100/H200 baseboard, you build 4 parallel spine-leaf planes; HCA1 on every host shares a plane, HCA2 on every host shares the next plane, and so on. NCCL pins each communication channel to a specific HCA, so collective traffic stays on its rail and never contends with other rails inside the fabric.
Quantum-3 with the InfiniBand Director chassis can collapse a three-tier fat-tree back into a two-tier spine-leaf for clusters up to ~16k endpoints — fewer cables, fewer hops, same bisection.
Collapsed-core saves rack units and cable count at small scale but stops being useful above ~512 endpoints; expand to classic spine-leaf at that point.
On Ethernet, EVPN VXLAN over a BGP unnumbered underlay is the standard control plane. On InfiniBand, OpenSM or NVIDIA UFM is mandatory and there is no separate overlay — partitions (PKeys) do tenant isolation.

Variant	What changes	When to use	Trade-off
Classic spine-leaf	Pure two-tier bipartite graph	General data centre, storage, inference fleets up to ~2k endpoints	Capped by leaf radix × number of leaves
Collapsed-core	Spine and core merged into a single tier of larger switches	Small pods of a few hundred endpoints	Less granular expansion; single-vendor lock-in to large chassis
Super-spine three-tier	Adds a third tier above pods of spine-leaf for multi-pod scale	Multi-pod sites; >10k endpoints	More cables, more hops, more operational complexity
Rail-optimised spine-leaf	Each GPU's NICs map to N independent rails, each its own spine-leaf plane	AI training fabrics where NCCL benefits from per-rail locality	Tight cabling discipline; rails are independent failure domains
Dual-stack (Ethernet + IB)	Separate spine-leaf fabrics for management/storage (Ethernet) and GPU collectives (InfiniBand)	Sites that want IB for AllReduce but Ethernet familiarity for ops	Two fabrics to operate; two telemetry pipelines

Sizing and capacity planning#

Yobitel NeoCloud's UK training-pod reference design lands on the Quantum-2 32-leaf × 16-spine non-blocking configuration as the standard sovereign-region building block; the Yobibyte training-pod abstraction the customer schedules into is one of these pods (or, for larger fine-tunes, an aggregation of several).
Inference fabrics (Yobitel NeoCloud's inference-optimised pods, Yobibyte managed inference endpoints) typically use a 2:1 oversubscribed spine-leaf because inference traffic is mostly north-south to the gateway, not east-west between GPUs.
Storage fabrics use a separate spine-leaf at 400 Gb/s Ethernet (Spectrum-3 or Tomahawk 4) — high enough for NVMe-oF and parallel filesystem traffic, dedicated so RDMA storage flows do not contend with training collectives.
Beyond ~2,048 endpoints on a single Quantum-2 spine-leaf, the leaf-radix arithmetic forces a third tier. Use a fat-tree (see related entry) or a Quantum-3 Director-based collapsed two-tier design.
Power: each Quantum-2 leaf draws ~750 W, each Quantum-3 leaf draws ~1.2 kW. Budget ~1-1.5 kW per spine-leaf switch when PDU sizing.

Fabric tech	Leaf SKU	Endpoints / leaf	Leaves × spines for 1,024 endpoints	Cross-leaf bisection
InfiniBand NDR	Quantum-2 (64 × 400 Gb/s)	32	32 × 16	204.8 Tb/s
InfiniBand XDR	Quantum-3 (64 × 800 Gb/s)	32	32 × 16	409.6 Tb/s
800G Ethernet	Spectrum-X SN5600 (64 × 800 Gb/s)	32	32 × 16	409.6 Tb/s
800G Ethernet (compact)	Tomahawk 5 (64 × 800 Gb/s)	64	16 × 8	409.6 Tb/s
400G Ethernet	Spectrum-3 / Tomahawk 4 (32 × 400 Gb/s)	16	64 × 16	102.4 Tb/s

Versus alternatives#

Spine-leaf is the right default for most data centre fabrics; the question is when something else wins.

Versus fat-tree: a fat-tree is a spine-leaf with an extra tier above the spines. If two tiers fit your scale, use spine-leaf — fewer hops, fewer cables, lower latency, simpler ops.
Versus dragonfly: dragonfly minimises long-haul optical cable cost by grouping endpoints and routing between groups, but pays for it with longer worst-case paths and more demanding adaptive routing. Worthwhile at DOE-exascale scale, almost never worthwhile in commercial AI.
Versus 3D-torus and mesh: useful for nearest-neighbour HPC workloads but a bad fit for AllReduce-dominated AI collectives, which stress diagonal and long-range paths the same as nearest-neighbour ones.
Yobitel NeoCloud's choice is spine-leaf up to 1,024 GPUs per pod and fat-tree from 2,048 to 16,000 GPUs per pod. Beyond that, Director-based collapsed two-tier designs come back into play.

Topology	Diameter	Cable scaling	Best at	Avoid when
Spine-leaf (2-tier)	2 hops cross-leaf	O(N)	General data centre, inference fleets, training pods up to ~2k endpoints	Above ~2k endpoints on commodity radix
Fat-tree (3-tier)	4 hops cross-pod	O(N^(3/2))	Training clusters from ~2k to ~16k endpoints with full bisection	Smaller scales where the extra tier adds cost without benefit
Dragonfly / Dragonfly+	3 hops typical, longer worst-case	O(N) with long-haul links	Supercomputing systems with strict optics-cost budgets at 10k+ endpoints	Commercial AI where adaptive-routing complexity is unjustified
3D-torus	O(N^(1/3)) hops	O(N)	Legacy HPC; specific science workloads	Modern training where AllReduce dominates
Hypercube / mesh	O(log N) or O(N^(1/2))	Mixed	Niche academic / chip-fabric use	Data centre AI

Trade-offs and known limitations#

Leaf radix bounds the design: a fabric of K-port leaves with L=K/2 northbound and D=K/2 southbound ports tops out at L × D = K^2/4 endpoints per spine-leaf. For K=64 that is 1,024 endpoints; above that, add a tier or move to higher-radix switches.
Oversubscription is a knife edge for AI: 1:1 is correct for training, 2:1 silently collapses AllReduce throughput by ~50%, and 4:1 is unusable for anything tightly-coupled. Document the ratio and audit it after every fabric change.
ECMP polarisation on Ethernet: long-lived elephant flows can hash to the same spine and create persistent hotspots. Mitigated by dynamic load balancing (DLB), packet-level spraying, or InfiniBand adaptive routing.
Cable count: a 32-leaf × 16-spine fabric has 32 × 16 = 512 inter-tier cables; a 64-leaf × 32-spine fabric has 2,048. Cable trays, patch panels and labelling discipline matter much more than the topology diagram suggests.
Mixing leaf and spine speeds (e.g. 400G leaves with 800G spines) is supported but creates a per-flow rate cap at the slower side; either run a uniform speed or accept the bottleneck explicitly.
Multi-tenant Layer 2: a single broadcast domain across leaves is an anti-pattern. Use VXLAN/EVPN on Ethernet or PKeys on InfiniBand for tenant isolation.
Rail-optimised cabling demands per-cable audit. A single mis-cabled rail link is invisible until NCCL collective performance drops on one rail.

Implementation notes#

Practical building blocks and where they sit in the modern operator's toolchain. None of this is novel — the discipline is in choosing one stack per pod and not letting the operations team drift.

InfiniBand spine-leaf: Quantum-2 (NDR) or Quantum-3 (XDR) leaves and spines, ConnectX-7 or ConnectX-8 HCAs at endpoints, OpenSM or NVIDIA UFM as Subnet Manager. Partition keys for tenant separation; adaptive routing and SHARPv3/SHARPv4 for collective acceleration.
Ethernet spine-leaf: Tomahawk 5, Spectrum-X SN5600, Silicon One G200 or Teralynx 10 leaves and spines; ConnectX-7/8 or BlueField-3/4 endpoints; BGP unnumbered underlay with EVPN VXLAN overlay (FRR, SONiC, Arista EOS, NVIDIA Cumulus Linux).
Rail-optimised AI fabric: per-rail planes wired independently; NCCL_IB_HCA pinned to per-rank HCA assignment; rail-aware NCCL topology file generated by `nvidia-smi topo`.
Storage tier: separate 400G Ethernet spine-leaf with RoCEv2 + PFC/ECN/DCQCN tuning, dedicated VLAN/EVPN tenant.
Observability: per-port telemetry (`ethtool -S`, UFM telemetry, sFlow, gNMI streaming), PFC pause counters and ECN marks for Ethernet RoCE fabrics, SHARP tree health from UFM for IB.
Yobitel NeoCloud chooses spine-leaf and rail-optimised spine-leaf at the pod level and exposes the result to Yobibyte and InferenceBench customers as opaque regions; customers see latency and bandwidth, not the underlying topology.

Where it fits in the Yobitel stack#

References

A Study of Non-Blocking Switching Networks (Clos, 1953) · Bell System Technical Journal
Spine-and-Leaf Architecture (Cisco) · Cisco
RFC 7938 — Use of BGP for Routing in Large-Scale Data Centres · IETF
A Scalable, Commodity Data Center Network Architecture (Al-Fares et al, 2008) · SIGCOMM 2008
NVIDIA DGX SuperPOD Reference Architecture · NVIDIA

Spine-Leaf Topology

Overview#

How it works#

Variants and architectural choices#

Sizing and capacity planning#

Versus alternatives#

Trade-offs and known limitations#

Implementation notes#

Where it fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

Spine-Leaf Topology

Overview#

How it works#

Variants and architectural choices#

Sizing and capacity planning#

Versus alternatives#

Trade-offs and known limitations#

Implementation notes#

Where it fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel