TL;DR
- A fat tree is a Clos topology with three tiers — typically called leaf, spine, and super-spine (or edge, aggregation, core) — providing full bisection bandwidth at scale.
- Introduced by Charles Leiserson in 1985 and adopted as the canonical data-centre AI fabric topology after Al-Fares et al's 2008 SIGCOMM paper on commodity Clos networks.
- Properties: any endpoint can communicate with any other at full link bandwidth, and the fabric has no choke points if dimensioned for full bisection.
- Trade-off: cable count grows as O(N^(3/2)); a 10k-endpoint fat tree at NDR speeds requires tens of thousands of optical links.
Overview#
A fat tree extends the spine-leaf idea to three tiers, providing full bisection bandwidth — the property that any partition of the endpoints into two equal halves can communicate at the aggregate of all their link bandwidths simultaneously. For AI training, this matters because AllReduce and AllToAll collectives stress every part of the bisection at once.
The 'fat' in fat tree comes from Leiserson's original observation that, in a tree where leaves are processors, links closer to the root must be 'fatter' (higher bandwidth) than leaves to avoid bottlenecks. In a commodity-switch implementation, you achieve the same property by having more parallel links rather than wider individual links.
Three-Tier Structure#
A typical three-tier fat tree with radix-K switches supports K³/4 endpoints. For K=64 (Quantum-2-class), that is 65,536 endpoints in principle — well over the size of any single AI training cluster.
Within the structure, endpoints attach to leaves; leaves attach to spines within a pod; spines attach to super-spines across pods. Each tier provides equal-cost paths that ECMP or InfiniBand adaptive routing exploits.
| Tier | Role | Typical count (1024-GPU fabric) |
|---|---|---|
| Leaf (edge) | Endpoint attachment | 32 leaves × 32 endpoints |
| Spine (aggregation) | Intra-pod aggregation | 16-32 spines |
| Super-spine (core) | Inter-pod connectivity | Optional for >1 pod |
AI-Specific Considerations#
- Full bisection is non-negotiable for training fabrics; oversubscription anywhere along the path caps AllReduce throughput.
- Cable count: O(N^(3/2)) means 10k-endpoint fabrics need careful cabling plans; mistakes are catastrophic.
- Adaptive routing (InfiniBand) or DLB/dynamic ECMP (Ethernet) needed to spread flows across equivalent paths.
- Rail-optimised cabling: in NVIDIA reference designs, each GPU's NIC is assigned to a specific 'rail' through the fabric, keeping intra-rail latency uniform.
Pitfalls#
- Static ECMP on Ethernet fat trees can cause flow polarisation — long-lived elephant flows pin to one path and create hotspots.
- Mixing tier bandwidth (e.g. NDR leaves with HDR spines) caps every flow at the slowest tier.
- Two-tier 'fat' trees are common in smaller pods; calling them fat trees is technically imprecise but operationally common.
References
- Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing (Leiserson, 1985) · IEEE Transactions on Computers
- A Scalable, Commodity Data Center Network Architecture (Al-Fares et al, 2008) · SIGCOMM 2008
- NVIDIA DGX SuperPOD Reference Architecture · NVIDIA