NVIDIA GB200 Grace Blackwell Superchip

TL;DR

Coherent compute module pairing one 72-core Grace ARMv9 CPU with two Blackwell B200 GPUs over NVLink-C2C at 900 GB/s — announced at GTC March 2024, in volume from mid-2025, the headline NVIDIA platform for frontier MoE training and trillion-parameter reasoning inference in 2026.
Per superchip: 2x B200 GPUs at 192 GB HBM3e each (384 GB total at 16 TB/s aggregate), 1x Grace CPU at 72 Neoverse V2 cores with up to 480 GB LPDDR5X (512 GB/s), full cache coherence across the C2C link. Per GPU: 1.8 TB/s NVLink 5.0 to fabric, ~9,000 TFLOPS FP4 sparse, ~4,500 TFLOPS FP8 sparse.
NVL72: 36 GB200 superchips (72 B200 GPUs + 36 Grace CPUs) in a single liquid-cooled rack with 9 fifth-generation NVSwitch trays, all wired over copper NVLink at 1.8 TB/s per GPU — the largest coherent NVLink domain NVIDIA ships, 13.8 TB HBM3e shared, 130 TB/s aggregate NVLink bisection.
Rack power envelope: 120-132 kW per NVL72 depending on configuration — well above the 30-40 kW limit of conventional air-cooled data centres. Full direct-to-chip liquid cooling is mandatory; rear-door heat exchangers cannot absorb NVL72 thermal load.
Unit of compute is the rack, not the card. GB200 is sold as full NVL72 systems or fractional NVL36 / NVL18 reservations; cloud rental pricing in 2026 typically resolves to $50,000-$70,000 per NVL72-equivalent-hour through long reservations and capacity blocks — there is no commodity on-demand $/GPU-hour market like there is for H100.

Overview#

GB200 is not a GPU. It is a coherent compute module — one Grace CPU and two B200 GPUs on a shared board, connected by NVLink-C2C at 900 GB/s in a cache-coherent topology that exposes a unified address space across all three dies. From the operating system's view, Grace is a normal ARMv9 server CPU with 72 cores and 480 GB of soldered LPDDR5X; from CUDA's view, the two Blackwell GPUs are ordinary devices that happen to be able to directly address host memory at NVLink bandwidths without PCIe overhead. What changes versus H100/H200 generation hosts is that the host-to-GPU and GPU-to-host data path stops being a bottleneck — Grace can stream optimiser state into HBM3e at 900 GB/s, and the GPU can spill activations back to LPDDR5X at the same rate.

The reason this matters in 2026 is the NVL72 rack. Thirty-six GB200 superchips slot into a single liquid-cooled rack alongside nine fifth-generation NVSwitch trays. The result is 72 Blackwell GPUs presented as a single 13.8 TB HBM3e pool with non-blocking all-to-all bandwidth at 1.8 TB/s per GPU and ~130 TB/s of aggregate NVLink bisection — the largest coherent NVLink domain NVIDIA has ever shipped, and the shared-memory abstraction that underpins production trillion-parameter MoE training and reasoning-model inference at frontier scale.

This entry is the 2026 reference for teams evaluating GB200 deployment: the architecture of the superchip itself, the NVL72 rack-scale system it builds, the per-superchip and per-rack spec sheets, the workloads that justify the move from H200/B200 to GB200, the operational realities (120 kW racks, mandatory liquid cooling, aarch64 host ISA), the commercial structure (you buy or reserve a rack, not a card), and the migration path from existing Hopper/Blackwell fleets. Yobitel NeoCloud will offer GB200 NVL72 and NVL36 partitions as preview committed-capacity reservations in UK and EU regions with NCSC OFFICIAL alignment, hosted in direct-to-chip liquid-cooled racks that Yobitel operates end-to-end. This entry helps you decide when GB200 is the right pick for your workload and how to size and price it on Yobitel NeoCloud or your own cluster.

How it works: Grace, Blackwell, and the C2C coherence link#

Blackwell is the GPU architecture; B200 is the discrete-die SKU (two reticle-limit dies bonded via a 10 TB/s die-to-die fabric, presenting as a single CUDA device); GB200 is the superchip integrating two B200 dies-pairs with a custom Grace CPU. The three components are distinct but the GB200 module is sold as one part — you cannot buy a Grace CPU separately, and the B200 GPUs inside a GB200 are physically and electrically different from discrete HGX-B200 modules (different power delivery, different cooling, no NVLink mezzanine — the fabric link goes through the carrier).

Grace itself is a co-processor designed for Blackwell. Seventy-two ARMv9 Neoverse V2 cores at ~3.5 GHz, 117 MB L3 cache, an SVE2-capable vector unit, full ECC throughout, and — critically — 480 GB of soldered LPDDR5X DRAM at ~512 GB/s. The LPDDR5X is on the package, not in DIMM slots. There are no upgrade paths and no per-customer memory configurations; every GB200 ships with the same 480 GB host memory.

The NVLink-C2C link is the architectural innovation that makes GB200 different from 'a Grace and two B200s on a board'. C2C runs at 900 GB/s (450 GB/s each direction) and implements full cache coherence — both the CPU and the GPUs see a single address space, with hardware-coherent caches and unified virtual memory. Pages allocated on the CPU side are addressable from CUDA kernels at LPDDR5X latencies; tensors allocated on the GPU side are visible to ARM threads through standard load/store. This eliminates the explicit cudaMemcpy / pinned-memory dance that dominated H100/H200 dataloader paths and unlocks a programming model where Grace becomes the natural home for things that previously had to live awkwardly on the GPU: MoE expert routing tables, KV-cache index structures, dataloader prefetch buffers, optimiser state for very large models.

The second-generation Transformer Engine on Blackwell adds FP4 (E2M1) Tensor Core support at twice the FP8 throughput, with microscaling (MX) formats that allow per-block scaling factors. Combined with the dual-die Blackwell silicon, peak FP4 throughput per GB200 superchip reaches ~36,000 TFLOPS sparse (~18,000 TFLOPS dense), and FP8 reaches ~18,000 TFLOPS sparse — roughly 2.5x H200 at the same precision.

GB200 superchip composition: 1x Grace CPU + 2x B200 GPUs on a single carrier, coherent across NVLink-C2C.
Grace die: 72 ARMv9 Neoverse V2 cores, 117 MB L3, 480 GB on-package LPDDR5X at ~512 GB/s, ECC throughout, SVE2 vector unit.
B200 die-pair: 2 dies bonded via 10 TB/s NV-HBI fabric presenting as one CUDA device; 192 GB HBM3e at 8 TB/s per GPU.
C2C coherent link: 900 GB/s aggregate, full cache coherence, unified virtual memory across CPU and both GPUs.
Per-GPU NVLink 5.0 to fabric: 1.8 TB/s bidirectional (twice H100/H200 NVLink 4.0).
Compute capability: sm_100 (Blackwell) for both GPUs in the superchip.
Module TDP: ~2,700 W typical (varies with rack configuration and workload).

Reference: per-superchip specification sheet#

Authoritative per-superchip figures. Rack-level numbers (NVL72) are listed in the next section at roughly 36x these values. Sparse Tensor figures assume 2:4 structured sparsity; dense throughput is half the listed sparse figure. The B200 die-pair inside a GB200 differs from discrete HGX-B200 SXM6 modules in power delivery, cooling integration and fabric attachment, but the silicon is identical.

Metric	Per GB200 superchip	Per B200 GPU inside GB200
Components	1x Grace CPU + 2x B200 GPUs	1 dual-die Blackwell GPU
Process	TSMC 4NP (GPU) + 4N (CPU)	TSMC 4NP
Transistors	~416 billion combined	208 billion per GPU
GPU memory	384 GB HBM3e total (192 GB per GPU)	192 GB HBM3e (8x 24 GB stacks)
GPU bandwidth	16 TB/s aggregate	8 TB/s per GPU
CPU memory	Up to 480 GB on-package LPDDR5X	n/a
CPU memory bandwidth	~512 GB/s	n/a
NVLink-C2C (CPU-to-GPU coherent)	900 GB/s aggregate	450 GB/s per GPU to Grace
NVLink 5.0 (GPU-to-fabric)	3.6 TB/s aggregate	1.8 TB/s per GPU bidirectional
FP4 (Tensor, sparse)	~36,000 TFLOPS	~18,000 TFLOPS per GPU
FP4 (Tensor, dense)	~18,000 TFLOPS	~9,000 TFLOPS per GPU
FP8 (Tensor, sparse)	~18,000 TFLOPS	~9,000 TFLOPS per GPU
BF16 / FP16 (Tensor, sparse)	~9,000 TFLOPS	~4,500 TFLOPS per GPU
INT8 (Tensor, sparse)	~18,000 TOPS	~9,000 TOPS per GPU
TDP (superchip)	~2,700 W typical	~1,000 W per GPU
Compute capability	sm_100	sm_100
Minimum driver	R550	R550
Recommended driver (2026)	R570 stable	R570
Minimum CUDA	12.4	12.4
Minimum NCCL	2.21 (NVLink5 topology)	2.21
Minimum Transformer Engine	1.7 (FP4 paths)	1.7
Confidential Compute	Yes (CC-on attested)	Yes
Form factor	Compute tray (NVL72 slot)	Integrated in superchip

Grace LPDDR5X is high-bandwidth (~512 GB/s) but is not HBM. Treat it as a near-tier for cold weights, optimiser state offload, dataloader prefetch buffers and MoE expert routing tables — never as an extension of HBM. Allocating activations or hot weights to LPDDR will silently halve throughput because the memory tier is exposed in the unified address space without explicit cost signalling.

The NVL72 rack-scale system#

GB200 is sold as part of NVL72, NVL36 or smaller fractional reservations — the unit of compute is the rack, not the superchip. NVL72 packs 18 compute trays (two GB200 superchips per tray), 9 NVSwitch trays (each carrying two fifth-generation NVSwitch ASICs), and the supporting power, cooling and management infrastructure into a single 42U liquid-cooled rack.

The fifth-generation NVSwitch fabric is what makes NVL72 distinct from 'a cluster of GB200 servers'. Each B200 GPU contributes 1.8 TB/s of NVLink 5.0 to the fabric, and the 72-GPU domain delivers ~130 TB/s of aggregate NVLink bisection bandwidth — non-blocking, full-bandwidth, copper-only (no optical transceivers, which would otherwise dominate the rack BoM). From a programming standpoint, NVL72 is the largest single coherent NVLink domain NVIDIA ships: tensor parallelism, expert parallelism and pipeline parallelism strategies that previously had to cross InfiniBand boundaries — paying a 5-10x latency penalty per collective — now stay inside one fabric.

The other piece of NVL72 architecture is the management plane. The 'Mission Control' rack-level orchestration introduced with NVL72 handles power budgeting, coolant flow telemetry, NVLink topology discovery, firmware updates, and the hardware lifecycle that the NVIDIA GPU Operator does not yet cover end-to-end for rack-scale systems. Mission Control is the API surface a Kubernetes operator or scheduler talks to when sizing or evicting work on an NVL72.

Metric	NVL72	NVL36 (half-rack)	HGX-B200 8-GPU (compare)
GB200 superchips	36	18	n/a
B200 GPUs	72	36	8
Grace CPUs	36	18	0 (x86 host)
NVSwitch trays	9 (gen 5)	9 (gen 5, partial)	4 onboard NVSwitch ASICs
GPU memory	13.8 TB HBM3e	6.9 TB HBM3e	1.5 TB HBM3e
CPU memory	17.3 TB LPDDR5X	8.6 TB LPDDR5X	host DDR5 (variable)
Aggregate FP4 (sparse)	~1,440 PFLOPS	~720 PFLOPS	~72 PFLOPS
NVLink bisection	~130 TB/s	~65 TB/s	~14.4 TB/s
Rack power	120-132 kW	~70 kW	~10 kW (per HGX node)
Cooling	Direct-to-chip liquid mandatory	Direct-to-chip liquid mandatory	Air or rear-door HX
Network egress (per rack)	Typically 8x or 16x InfiniBand NDR/XDR	4x-8x InfiniBand NDR	8x InfiniBand NDR per node
Smallest practical replica unit	1 rack (72 GPUs)	1 half-rack (36 GPUs)	1 node (8 GPUs)

NVL72 cannot be deployed in conventional data centres. The 120-132 kW per-rack envelope is 3-4x the 30-40 kW limit of standard air-cooled colo, and direct-to-chip liquid cooling requires a secondary loop with coolant distribution units (CDUs) sized for ~150 kW per rack. Treat the data centre buildout (or the colo provider's liquid-ready rack pool) as a 6-18 month lead-time dependency, not an afterthought.

Workload pattern A: Trillion-parameter MoE training on NVL72#

The flagship NVL72 workload. A 1T-parameter mixture-of-experts model with 64-128 experts per layer, trained at FP8 with Transformer Engine and FP4 attention scaling, spans the full 72-GPU domain as a single replica. Expert parallelism, tensor parallelism and pipeline parallelism all stay inside the NVLink-5 fabric — no InfiniBand hops for inter-expert dispatch — and the 17.3 TB LPDDR5X across the 36 Grace CPUs hosts the optimiser state and the expert routing tables. Megatron-Core 0.10+ and NeMo 24.07+ explicitly target this layout; recipes are published in the NVIDIA NeMo Framework Launcher repository.

The interesting operational metric is step time. On NVL72, a 1T MoE step at 4M token batch size lands at roughly 8-12 seconds wall-clock; on an equivalent H100 cluster of 256 GPUs over InfiniBand NDR, the same step typically takes 25-40 seconds because expert dispatch collectives cross fabric boundaries. The 30-40 % step-time reduction is what justifies the GB200 capex for frontier training — over a 3-month training run, it shaves weeks off time-to-trained-model.

Framework baseline: Megatron-Core 0.10+ or NeMo 24.07+ with `--expert-model-parallel-size 8 --tensor-model-parallel-size 8 --pipeline-model-parallel-size 4` typical for 1T MoE.
Transformer Engine 1.7+ required for FP4 attention paths; FP8 weights/gradients via Hopper-compatible recipes.
NCCL 2.21+ with NVLink5 topology discovery; verify with `NCCL_DEBUG=INFO NCCL_TOPO_DUMP_FILE=/tmp/topo.xml`.
Optimiser state offload to Grace LPDDR5X via DeepSpeed ZeRO-Offload-style hooks; saves ~30 % HBM headroom on adapters and momentum buffers.
Expert routing tables and gating decisions live on the CPU (coherent access from GPU), eliminating per-step memcpy bounce.
Sustained FP8 training MFU on NVL72: 38-46 % of peak — comparable to H100 BF16 MFU but on a vastly larger model.

Workload pattern B: Reasoning-model inference at trillion-parameter scale#

The 2026 inference workload that justifies NVL72 for inference (not just training): trillion-parameter reasoning models with long thought-trace contexts and pod-wide KV-cache pools. The 13.8 TB HBM3e shared across 72 B200 GPUs at 1.8 TB/s per-GPU NVLink means a single inference replica can host 1T+ weight footprints, 1M+ token context windows, and KV caches measured in terabytes rather than gigabytes — none of which fit on a single H200 or even an 8x B200 node.

vLLM, TensorRT-LLM and SGLang gained NVL72-aware sharding through 2025. The smallest practical inference replica for a 1T reasoning model is the full 72-GPU domain; for 405B-class dense models, replicas frequently split the rack into 2-4 partitions (36-GPU or 18-GPU sub-replicas) using NVLink-aware partition scheduling. Below 405B, NVL72 is over-provisioned and discrete HGX-B200 or HGX-H200 is the right tier.

Inference framework baseline: vLLM 0.7+ with NVL72-aware partition scheduling, or TensorRT-LLM 0.13+ with `--gpus_per_node 8 --num_nodes 9` (NVL72 maps as 9 logical nodes).
KV cache: pod-wide paged allocation across 13.8 TB HBM3e; prefix caching across replicas inside the rack.
FP4 attention paths for reasoning workloads with very long thought traces; FP8 weights for stable serving.
Latency target for 1T reasoning model on NVL72: 30-80 ms first-token, 80-180 tokens/sec sustained decode at moderate concurrency.
Throughput target: 12,000-22,000 output TPS aggregate across the rack at moderate concurrency on 405B dense models.
Smaller replicas (sub-rack NVL36/NVL18) suit 405B-class dense models without paying the full 72-GPU footprint.

Sizing and capacity planning#

Sizing tables for GB200 deployments. All figures assume NVL72 with full direct-to-chip liquid cooling, NCCL 2.21+ with NVLink5 topology, Transformer Engine 1.7+ for FP4/FP8 paths, and Megatron-Core 0.10+ / vLLM 0.7+ / TensorRT-LLM 0.13+ as the framework baselines. Throughput is given in aggregate TPS across the listed footprint; verify on InferenceBench before committing capacity. The smallest practical purchase unit is NVL18 (quarter-rack); below that, discrete HGX-B200 is the better fit.

Smallest practical NVL72 purchase: 1 rack. Smallest fractional reservation: NVL18 (18 GPUs, quarter-rack).
Below NVL18, the rack-scale fabric is wasted — use discrete HGX-B200 (8 GPUs per node) over InfiniBand instead.
Inter-rack scale-out from NVL72 to multi-rack pods: InfiniBand XDR (800 Gb/s) or NVLink Switch System chassis (third-party); expect a noticeable but manageable step-down in collective performance.
MFU (model FLOPs utilisation) on FP4 training: 28-36 % of peak today; expected to climb to 38-48 % as Transformer Engine and NeMo paths mature through 2026.
Power budget headroom: NVL72 draws 120-132 kW typical, with bursts to 145 kW. Size the rack circuit at 160-180 kW for safety margin.
Cooling budget: ~150 kW per rack of CDU capacity; secondary loop at 35 C supply, 45 C return is typical.

Workload	Precision	Footprint	Throughput / step time	Notes
1T MoE training (e.g. 1T/64-expert)	FP8 + FP4 attention	1x NVL72 (72 GPUs)	8-12s step at 4M-token batch	Expert parallelism stays in NVLink
405B dense training	FP8 + Transformer Engine	1x NVL36 (36 GPUs)	4-6s step at 2M-token batch	TP=8, PP=4, EP=1
1T reasoning inference (high concurrency)	FP4 attention + FP8 weights	1x NVL72	12,000-22,000 output TPS	Pod-wide KV cache
1T reasoning inference (latency-optimised)	FP8 + FP4 attention	1x NVL72	30-80 ms first-token	Small batch, low concurrency
405B reasoning inference	FP4 + FP8	1x NVL36 (36 GPUs)	4,500-7,000 output TPS	Half-rack replica
405B dense inference	FP4 weights	1x NVL18 (18 GPUs)	2,400-3,500 output TPS	Quarter-rack replica
70B fine-tune at 1M context	BF16 + Transformer Engine	1x NVL18 (18 GPUs)	1.8s step at 256K tokens	Context length unlocked by HBM3e pool
MoE inference (Mixtral 8x22B class)	FP8	1x NVL18 (18 GPUs)	9,000-13,000 output TPS	Over-provisioned — HGX-B200 cheaper

Cost and TCO#

GB200 commercial structure differs fundamentally from H100/H200. There is no hourly on-demand market like AWS p5 or GCP a3; cloud capacity is sold as capacity blocks, multi-month reservations, or full-rack term commitments. The relevant unit of FinOps is the NVL72-rack-hour, not the GPU-hour. Capex purchase from NVIDIA Partners (Dell, HPE, Supermicro, Lenovo, Wiwynn) lands at $3.0-$3.8 million per NVL72 system; multi-rack pod deployments with InfiniBand backbones add another 15-25 % for fabric, switches and structured cabling.

On-prem TCO: roughly $18k-$26k per NVL72-rack-hour equivalent over a 3-year amortisation at typical UK/EU power prices ($0.12-$0.20/kWh) and 78-85 % steady-state utilisation.
Power: 120-132 kW typical x 8,760 hours x $0.15/kWh = $158k-$174k per rack-year just for power; cooling adds another $30k-$45k.
Capex breakeven versus 3-year cloud committed-use: roughly 24-30 months for organisations with steady > 60 % utilisation and the data centre to host the rack.
Cost-per-million-output-tokens on 1T reasoning model inference at $26k/rack-hour and 18,000 TPS aggregate: roughly $0.40 per million tokens — a fraction of API list prices for equivalent commercial reasoning models.
Multi-rack pod scaling: typically 4-16 NVL72 racks per pod interconnected over InfiniBand XDR; pod-level pricing is bespoke and dominated by reservation term and committed throughput.

Channel	Unit	Approx 2026 price	Notes
NVIDIA Partner capex	1x NVL72 system	$3.0M-$3.8M	Plus install, commissioning, 1y support
NVIDIA Partner capex	1x NVL36 system	$1.6M-$2.0M	Half-rack variant
Cloud capacity block (hyperscaler)	1x NVL72-rack-hour equivalent	$50k-$70k	Typically 1-90 day windows, prepaid
Cloud committed-use (3-year)	1x NVL72-rack-hour equivalent	$22k-$32k	Effective rate at 75 %+ utilisation
Tier-1 neocloud reservation	1x NVL36-rack-hour equivalent	$18k-$28k	Quarter-rack reservations more flexible
Yobitel NeoCloud (UK + EU, preview)	1x NVL72-rack-hour equivalent	$24k-$32k (3y committed)	NCSC OFFICIAL-aligned, liquid-cooled racks; NVL18+ reservations.
Yobitel Omniscient Compute	Per-replica (NVL18+)	Market-clearing	Cross-provider arbitrage on top of NeoCloud + partner capacity; FOCUS-conformant billing.
TCO opex (3-year, capex amortised)	1x NVL72-rack-hour equivalent	$18k-$26k	Includes power, cooling, ops, replacement

Cost figures land on the FinOps Foundation FOCUS billing spec when consumed via Yobitel Omniscient Compute: ServiceName=`AcceleratorCompute`, ChargeCategory=`Usage`, SkuId=`gpu.gb200.nvl72.rack` or `gpu.gb200.nvl18.partition`. The rack/partition is the billed unit, not the individual GPU.

Migration and alternatives#

GB200 is the right unit of compute for a specific class of workloads — frontier MoE training, trillion-parameter reasoning inference, and any job that genuinely needs more than 8 NVLink-coherent GPUs. For everything else, discrete B200 (HGX-B200) or H200 is the better fit at materially lower cost and operational complexity.

Heuristic 1: if the unit of work is 8 GPUs or fewer, GB200 is overprovisioned — use discrete HGX-B200 at a fraction of the operational complexity.
Heuristic 2: if you have not yet committed to direct-to-chip liquid cooling and 120+ kW racks in your data centre, the lead time for the buildout exceeds the lead time for the GPUs.
Heuristic 3: if any part of your stack has x86-only precompiled binaries that cannot be rebuilt for aarch64 (closed-source profilers, vendor security agents, some commercial training frameworks), GB200 will block deployment until those dependencies are resolved.
Heuristic 4: if your workload uses Megatron-Core, NeMo, vLLM, TensorRT-LLM or SGLang as the framework baseline, the NVL72-aware paths are mature enough for production; older frameworks may not benefit from the rack-scale fabric without reworking.

From / to	When it pays	Migration effort	Key incompatibility
H100/H200 -> discrete B200	Need FP4 or 8 TB/s HBM3e per GPU	Medium (CUDA 12.4+, FP4 calibration)	MX formats; some kernels need rework
Discrete B200 (HGX) -> GB200 NVL72	Workload exceeds 8-GPU NVLink domain	High (rack-scale data centre buildout)	120 kW racks, liquid cooling, ARM host
H100/H200 cluster -> GB200 NVL72	Frontier MoE training or trillion-param inference	High (full stack change)	ARM host (aarch64), Mission Control, rack-scale ops
GB200 NVL72 -> GB300 NVL72	Mid-cycle refresh, more HBM3e or higher clocks	Trivial (drop-in for new procurement)	Refresh-only — no live migration path
GB200 -> x86 + InfiniBand cluster	Stuck on x86-only legacy stack components	n/a (decision against GB200)	ARM CPU is non-negotiable on GB200

Pitfalls and operational notes#

ARM ISA — Grace runs aarch64. Most CUDA-Python stacks port cleanly, but any precompiled x86-only binaries (closed-source profilers, vendor agents, some commercial training frameworks) must be rebuilt or replaced. Run an aarch64 compatibility audit before committing to GB200.
Rack power — NVL72 lands at 120-132 kW typical with 145 kW bursts. This is well above conventional data centre rack budgets (30-40 kW). Bespoke power distribution required; 415V 3-phase typical, with 240V variants in North American deployments.
Cooling — full direct-to-chip liquid is non-negotiable. Rear-door heat exchangers cannot absorb NVL72 thermal load. Secondary loop sized at ~150 kW per rack of CDU capacity. Coolant supply at 35-40 C, return at 45-50 C typical.
Software maturity — the full NVL72 shared-memory programming model is most useful with Megatron-Core, NeMo, vLLM and TensorRT-LLM paths that explicitly target it. Naive DDP or FSDP-1 workloads gain little over discrete B200 because they do not exploit the rack-scale fabric.
Coherent does not mean uniform — LPDDR5X (~512 GB/s) and HBM3e (8 TB/s per GPU) are tiered. Allocating activations or hot weights to LPDDR will silently halve throughput because the cost is not exposed in the unified memory API.
Mission Control is the rack-level control plane — the NVIDIA GPU Operator does not yet cover rack-scale lifecycle end-to-end. Kubernetes integrations talk to Mission Control through its API for power budgeting, NVLink topology and firmware updates.
NVLink5 topology errors at the rack scale are operationally distinct from H100 NVLink errors. A single NVSwitch tray failure can drop the rack from 130 TB/s to 100 TB/s bisection; alerting must monitor per-switch fabric health, not just per-GPU.
Lead time — GB200 systems in 2026 still have 12-24 week procurement lead times. Data centre buildout (liquid cooling, power distribution, structural floor loading for ~1,400 kg per rack) typically takes 6-18 months for greenfield space.
Spot/preemptible market — does not exist for GB200. The shortest commercial commitment is typically a 24-hour capacity block on the largest hyperscalers; everything else is multi-week reservation or capex.

Where this fits in the Yobitel stack#

GB200 is the frontier tier in the Yobitel stack from 2026 onward. Yobibyte — our AI-native platform — exposes NVL72 (and NVL36/NVL18 partitions) as a first-class scheduling target for training and inference workspaces that genuinely need rack-scale fabric. The platform's placement layer is aware of NVLink5 topology, Grace coherence and the difference between rack-local and pod-spanning collectives; it will refuse to place a workload that does not benefit from NVL72 onto NVL72 capacity, defaulting it instead to discrete HGX-B200 or H200 nodes.

Omniscient Compute — our cross-cloud capacity broker — indexes GB200 capacity blocks and committed-use reservations across every connected provider that offers NVL72 (the major hyperscalers and Tier-1 neoclouds that have completed liquid-ready data centre buildouts). Because GB200 capacity is reservation-priced rather than spot-priced, the broker's role is matching workspace utilisation patterns to reservation tranches rather than per-hour arbitrage — the FinOps calculus differs fundamentally from H100/L40S.

InferenceBench — our public, reproducible benchmarking harness — publishes NVL72 throughput and latency figures for trillion-parameter reasoning models and 405B dense models on vLLM, TensorRT-LLM and SGLang. The GB200 sizing tables in this entry are anchored on InferenceBench runs; if you are evaluating an NVL72 deployment, start with InferenceBench to verify the model and framework actually exploit the rack-scale fabric before committing to the capex.

References

NVIDIA GB200 NVL72 Product Page · NVIDIA
Grace CPU Architecture Whitepaper · NVIDIA
Blackwell Architecture Whitepaper · NVIDIA
NVLink Switch System (gen 5) · NVIDIA
Transformer Engine FP4 user guide · NVIDIA
NeMo Framework Launcher (NVL72 recipes) · NVIDIA
FinOps Foundation FOCUS billing specification · FinOps Foundation

TL;DR

Coherent compute module pairing one 72-core Grace ARMv9 CPU with two Blackwell B200 GPUs over NVLink-C2C at 900 GB/s — announced at GTC March 2024, in volume from mid-2025, the headline NVIDIA platform for frontier MoE training and trillion-parameter reasoning inference in 2026.
Per superchip: 2x B200 GPUs at 192 GB HBM3e each (384 GB total at 16 TB/s aggregate), 1x Grace CPU at 72 Neoverse V2 cores with up to 480 GB LPDDR5X (512 GB/s), full cache coherence across the C2C link. Per GPU: 1.8 TB/s NVLink 5.0 to fabric, ~9,000 TFLOPS FP4 sparse, ~4,500 TFLOPS FP8 sparse.
NVL72: 36 GB200 superchips (72 B200 GPUs + 36 Grace CPUs) in a single liquid-cooled rack with 9 fifth-generation NVSwitch trays, all wired over copper NVLink at 1.8 TB/s per GPU — the largest coherent NVLink domain NVIDIA ships, 13.8 TB HBM3e shared, 130 TB/s aggregate NVLink bisection.
Rack power envelope: 120-132 kW per NVL72 depending on configuration — well above the 30-40 kW limit of conventional air-cooled data centres. Full direct-to-chip liquid cooling is mandatory; rear-door heat exchangers cannot absorb NVL72 thermal load.
Unit of compute is the rack, not the card. GB200 is sold as full NVL72 systems or fractional NVL36 / NVL18 reservations; cloud rental pricing in 2026 typically resolves to $50,000-$70,000 per NVL72-equivalent-hour through long reservations and capacity blocks — there is no commodity on-demand $/GPU-hour market like there is for H100.

Overview#

How it works: Grace, Blackwell, and the C2C coherence link#

GB200 superchip composition: 1x Grace CPU + 2x B200 GPUs on a single carrier, coherent across NVLink-C2C.
Grace die: 72 ARMv9 Neoverse V2 cores, 117 MB L3, 480 GB on-package LPDDR5X at ~512 GB/s, ECC throughout, SVE2 vector unit.
B200 die-pair: 2 dies bonded via 10 TB/s NV-HBI fabric presenting as one CUDA device; 192 GB HBM3e at 8 TB/s per GPU.
C2C coherent link: 900 GB/s aggregate, full cache coherence, unified virtual memory across CPU and both GPUs.
Per-GPU NVLink 5.0 to fabric: 1.8 TB/s bidirectional (twice H100/H200 NVLink 4.0).
Compute capability: sm_100 (Blackwell) for both GPUs in the superchip.
Module TDP: ~2,700 W typical (varies with rack configuration and workload).

Reference: per-superchip specification sheet#

Metric	Per GB200 superchip	Per B200 GPU inside GB200
Components	1x Grace CPU + 2x B200 GPUs	1 dual-die Blackwell GPU
Process	TSMC 4NP (GPU) + 4N (CPU)	TSMC 4NP
Transistors	~416 billion combined	208 billion per GPU
GPU memory	384 GB HBM3e total (192 GB per GPU)	192 GB HBM3e (8x 24 GB stacks)
GPU bandwidth	16 TB/s aggregate	8 TB/s per GPU
CPU memory	Up to 480 GB on-package LPDDR5X	n/a
CPU memory bandwidth	~512 GB/s	n/a
NVLink-C2C (CPU-to-GPU coherent)	900 GB/s aggregate	450 GB/s per GPU to Grace
NVLink 5.0 (GPU-to-fabric)	3.6 TB/s aggregate	1.8 TB/s per GPU bidirectional
FP4 (Tensor, sparse)	~36,000 TFLOPS	~18,000 TFLOPS per GPU
FP4 (Tensor, dense)	~18,000 TFLOPS	~9,000 TFLOPS per GPU
FP8 (Tensor, sparse)	~18,000 TFLOPS	~9,000 TFLOPS per GPU
BF16 / FP16 (Tensor, sparse)	~9,000 TFLOPS	~4,500 TFLOPS per GPU
INT8 (Tensor, sparse)	~18,000 TOPS	~9,000 TOPS per GPU
TDP (superchip)	~2,700 W typical	~1,000 W per GPU
Compute capability	sm_100	sm_100
Minimum driver	R550	R550
Recommended driver (2026)	R570 stable	R570
Minimum CUDA	12.4	12.4
Minimum NCCL	2.21 (NVLink5 topology)	2.21
Minimum Transformer Engine	1.7 (FP4 paths)	1.7
Confidential Compute	Yes (CC-on attested)	Yes
Form factor	Compute tray (NVL72 slot)	Integrated in superchip

The NVL72 rack-scale system#

Metric	NVL72	NVL36 (half-rack)	HGX-B200 8-GPU (compare)
GB200 superchips	36	18	n/a
B200 GPUs	72	36	8
Grace CPUs	36	18	0 (x86 host)
NVSwitch trays	9 (gen 5)	9 (gen 5, partial)	4 onboard NVSwitch ASICs
GPU memory	13.8 TB HBM3e	6.9 TB HBM3e	1.5 TB HBM3e
CPU memory	17.3 TB LPDDR5X	8.6 TB LPDDR5X	host DDR5 (variable)
Aggregate FP4 (sparse)	~1,440 PFLOPS	~720 PFLOPS	~72 PFLOPS
NVLink bisection	~130 TB/s	~65 TB/s	~14.4 TB/s
Rack power	120-132 kW	~70 kW	~10 kW (per HGX node)
Cooling	Direct-to-chip liquid mandatory	Direct-to-chip liquid mandatory	Air or rear-door HX
Network egress (per rack)	Typically 8x or 16x InfiniBand NDR/XDR	4x-8x InfiniBand NDR	8x InfiniBand NDR per node
Smallest practical replica unit	1 rack (72 GPUs)	1 half-rack (36 GPUs)	1 node (8 GPUs)

Workload pattern A: Trillion-parameter MoE training on NVL72#

Framework baseline: Megatron-Core 0.10+ or NeMo 24.07+ with `--expert-model-parallel-size 8 --tensor-model-parallel-size 8 --pipeline-model-parallel-size 4` typical for 1T MoE.
Transformer Engine 1.7+ required for FP4 attention paths; FP8 weights/gradients via Hopper-compatible recipes.
NCCL 2.21+ with NVLink5 topology discovery; verify with `NCCL_DEBUG=INFO NCCL_TOPO_DUMP_FILE=/tmp/topo.xml`.
Optimiser state offload to Grace LPDDR5X via DeepSpeed ZeRO-Offload-style hooks; saves ~30 % HBM headroom on adapters and momentum buffers.
Expert routing tables and gating decisions live on the CPU (coherent access from GPU), eliminating per-step memcpy bounce.
Sustained FP8 training MFU on NVL72: 38-46 % of peak — comparable to H100 BF16 MFU but on a vastly larger model.

Workload pattern B: Reasoning-model inference at trillion-parameter scale#

Inference framework baseline: vLLM 0.7+ with NVL72-aware partition scheduling, or TensorRT-LLM 0.13+ with `--gpus_per_node 8 --num_nodes 9` (NVL72 maps as 9 logical nodes).
KV cache: pod-wide paged allocation across 13.8 TB HBM3e; prefix caching across replicas inside the rack.
FP4 attention paths for reasoning workloads with very long thought traces; FP8 weights for stable serving.
Latency target for 1T reasoning model on NVL72: 30-80 ms first-token, 80-180 tokens/sec sustained decode at moderate concurrency.
Throughput target: 12,000-22,000 output TPS aggregate across the rack at moderate concurrency on 405B dense models.
Smaller replicas (sub-rack NVL36/NVL18) suit 405B-class dense models without paying the full 72-GPU footprint.

Sizing and capacity planning#

Smallest practical NVL72 purchase: 1 rack. Smallest fractional reservation: NVL18 (18 GPUs, quarter-rack).
Below NVL18, the rack-scale fabric is wasted — use discrete HGX-B200 (8 GPUs per node) over InfiniBand instead.
Inter-rack scale-out from NVL72 to multi-rack pods: InfiniBand XDR (800 Gb/s) or NVLink Switch System chassis (third-party); expect a noticeable but manageable step-down in collective performance.
MFU (model FLOPs utilisation) on FP4 training: 28-36 % of peak today; expected to climb to 38-48 % as Transformer Engine and NeMo paths mature through 2026.
Power budget headroom: NVL72 draws 120-132 kW typical, with bursts to 145 kW. Size the rack circuit at 160-180 kW for safety margin.
Cooling budget: ~150 kW per rack of CDU capacity; secondary loop at 35 C supply, 45 C return is typical.

Workload	Precision	Footprint	Throughput / step time	Notes
1T MoE training (e.g. 1T/64-expert)	FP8 + FP4 attention	1x NVL72 (72 GPUs)	8-12s step at 4M-token batch	Expert parallelism stays in NVLink
405B dense training	FP8 + Transformer Engine	1x NVL36 (36 GPUs)	4-6s step at 2M-token batch	TP=8, PP=4, EP=1
1T reasoning inference (high concurrency)	FP4 attention + FP8 weights	1x NVL72	12,000-22,000 output TPS	Pod-wide KV cache
1T reasoning inference (latency-optimised)	FP8 + FP4 attention	1x NVL72	30-80 ms first-token	Small batch, low concurrency
405B reasoning inference	FP4 + FP8	1x NVL36 (36 GPUs)	4,500-7,000 output TPS	Half-rack replica
405B dense inference	FP4 weights	1x NVL18 (18 GPUs)	2,400-3,500 output TPS	Quarter-rack replica
70B fine-tune at 1M context	BF16 + Transformer Engine	1x NVL18 (18 GPUs)	1.8s step at 256K tokens	Context length unlocked by HBM3e pool
MoE inference (Mixtral 8x22B class)	FP8	1x NVL18 (18 GPUs)	9,000-13,000 output TPS	Over-provisioned — HGX-B200 cheaper

Cost and TCO#

On-prem TCO: roughly $18k-$26k per NVL72-rack-hour equivalent over a 3-year amortisation at typical UK/EU power prices ($0.12-$0.20/kWh) and 78-85 % steady-state utilisation.
Power: 120-132 kW typical x 8,760 hours x $0.15/kWh = $158k-$174k per rack-year just for power; cooling adds another $30k-$45k.
Capex breakeven versus 3-year cloud committed-use: roughly 24-30 months for organisations with steady > 60 % utilisation and the data centre to host the rack.
Cost-per-million-output-tokens on 1T reasoning model inference at $26k/rack-hour and 18,000 TPS aggregate: roughly $0.40 per million tokens — a fraction of API list prices for equivalent commercial reasoning models.
Multi-rack pod scaling: typically 4-16 NVL72 racks per pod interconnected over InfiniBand XDR; pod-level pricing is bespoke and dominated by reservation term and committed throughput.

Channel	Unit	Approx 2026 price	Notes
NVIDIA Partner capex	1x NVL72 system	$3.0M-$3.8M	Plus install, commissioning, 1y support
NVIDIA Partner capex	1x NVL36 system	$1.6M-$2.0M	Half-rack variant
Cloud capacity block (hyperscaler)	1x NVL72-rack-hour equivalent	$50k-$70k	Typically 1-90 day windows, prepaid
Cloud committed-use (3-year)	1x NVL72-rack-hour equivalent	$22k-$32k	Effective rate at 75 %+ utilisation
Tier-1 neocloud reservation	1x NVL36-rack-hour equivalent	$18k-$28k	Quarter-rack reservations more flexible
Yobitel NeoCloud (UK + EU, preview)	1x NVL72-rack-hour equivalent	$24k-$32k (3y committed)	NCSC OFFICIAL-aligned, liquid-cooled racks; NVL18+ reservations.
Yobitel Omniscient Compute	Per-replica (NVL18+)	Market-clearing	Cross-provider arbitrage on top of NeoCloud + partner capacity; FOCUS-conformant billing.
TCO opex (3-year, capex amortised)	1x NVL72-rack-hour equivalent	$18k-$26k	Includes power, cooling, ops, replacement

Migration and alternatives#

Heuristic 1: if the unit of work is 8 GPUs or fewer, GB200 is overprovisioned — use discrete HGX-B200 at a fraction of the operational complexity.
Heuristic 2: if you have not yet committed to direct-to-chip liquid cooling and 120+ kW racks in your data centre, the lead time for the buildout exceeds the lead time for the GPUs.
Heuristic 3: if any part of your stack has x86-only precompiled binaries that cannot be rebuilt for aarch64 (closed-source profilers, vendor security agents, some commercial training frameworks), GB200 will block deployment until those dependencies are resolved.
Heuristic 4: if your workload uses Megatron-Core, NeMo, vLLM, TensorRT-LLM or SGLang as the framework baseline, the NVL72-aware paths are mature enough for production; older frameworks may not benefit from the rack-scale fabric without reworking.

From / to	When it pays	Migration effort	Key incompatibility
H100/H200 -> discrete B200	Need FP4 or 8 TB/s HBM3e per GPU	Medium (CUDA 12.4+, FP4 calibration)	MX formats; some kernels need rework
Discrete B200 (HGX) -> GB200 NVL72	Workload exceeds 8-GPU NVLink domain	High (rack-scale data centre buildout)	120 kW racks, liquid cooling, ARM host
H100/H200 cluster -> GB200 NVL72	Frontier MoE training or trillion-param inference	High (full stack change)	ARM host (aarch64), Mission Control, rack-scale ops
GB200 NVL72 -> GB300 NVL72	Mid-cycle refresh, more HBM3e or higher clocks	Trivial (drop-in for new procurement)	Refresh-only — no live migration path
GB200 -> x86 + InfiniBand cluster	Stuck on x86-only legacy stack components	n/a (decision against GB200)	ARM CPU is non-negotiable on GB200

Pitfalls and operational notes#

ARM ISA — Grace runs aarch64. Most CUDA-Python stacks port cleanly, but any precompiled x86-only binaries (closed-source profilers, vendor agents, some commercial training frameworks) must be rebuilt or replaced. Run an aarch64 compatibility audit before committing to GB200.
Rack power — NVL72 lands at 120-132 kW typical with 145 kW bursts. This is well above conventional data centre rack budgets (30-40 kW). Bespoke power distribution required; 415V 3-phase typical, with 240V variants in North American deployments.
Cooling — full direct-to-chip liquid is non-negotiable. Rear-door heat exchangers cannot absorb NVL72 thermal load. Secondary loop sized at ~150 kW per rack of CDU capacity. Coolant supply at 35-40 C, return at 45-50 C typical.
Software maturity — the full NVL72 shared-memory programming model is most useful with Megatron-Core, NeMo, vLLM and TensorRT-LLM paths that explicitly target it. Naive DDP or FSDP-1 workloads gain little over discrete B200 because they do not exploit the rack-scale fabric.
Coherent does not mean uniform — LPDDR5X (~512 GB/s) and HBM3e (8 TB/s per GPU) are tiered. Allocating activations or hot weights to LPDDR will silently halve throughput because the cost is not exposed in the unified memory API.
Mission Control is the rack-level control plane — the NVIDIA GPU Operator does not yet cover rack-scale lifecycle end-to-end. Kubernetes integrations talk to Mission Control through its API for power budgeting, NVLink topology and firmware updates.
NVLink5 topology errors at the rack scale are operationally distinct from H100 NVLink errors. A single NVSwitch tray failure can drop the rack from 130 TB/s to 100 TB/s bisection; alerting must monitor per-switch fabric health, not just per-GPU.
Lead time — GB200 systems in 2026 still have 12-24 week procurement lead times. Data centre buildout (liquid cooling, power distribution, structural floor loading for ~1,400 kg per rack) typically takes 6-18 months for greenfield space.
Spot/preemptible market — does not exist for GB200. The shortest commercial commitment is typically a 24-hour capacity block on the largest hyperscalers; everything else is multi-week reservation or capex.

Where this fits in the Yobitel stack#

References

NVIDIA GB200 NVL72 Product Page · NVIDIA
Grace CPU Architecture Whitepaper · NVIDIA
Blackwell Architecture Whitepaper · NVIDIA
NVLink Switch System (gen 5) · NVIDIA
Transformer Engine FP4 user guide · NVIDIA
NeMo Framework Launcher (NVL72 recipes) · NVIDIA
FinOps Foundation FOCUS billing specification · FinOps Foundation

NVIDIA GB200 Grace Blackwell Superchip

Overview#

How it works: Grace, Blackwell, and the C2C coherence link#

Reference: per-superchip specification sheet#

The NVL72 rack-scale system#

Workload pattern A: Trillion-parameter MoE training on NVL72#

Workload pattern B: Reasoning-model inference at trillion-parameter scale#

Sizing and capacity planning#

Cost and TCO#

Migration and alternatives#

Pitfalls and operational notes#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

NVIDIA GB200 Grace Blackwell Superchip

Overview#

How it works: Grace, Blackwell, and the C2C coherence link#

Reference: per-superchip specification sheet#

The NVL72 rack-scale system#

Workload pattern A: Trillion-parameter MoE training on NVL72#

Workload pattern B: Reasoning-model inference at trillion-parameter scale#

Sizing and capacity planning#

Cost and TCO#

Migration and alternatives#

Pitfalls and operational notes#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel