TL;DR
- First-generation Blackwell SKU positioned as a 700 W drop-in for HGX chassis that previously hosted H100 or H200.
- Uses the same dual-die Blackwell package as B200 but with reduced clocks and a lower TDP ceiling, trading peak throughput for thermal compatibility.
- Carries the second-generation Transformer Engine with native FP4 support and a fifth-generation NVLink at 1.8 TB/s per GPU.
- Largely superseded by B200 in new deployments; B100 sees most use in upgrade paths where rack-level cooling cannot absorb 1,000 W per GPU.
Overview#
The B100 is the air-cooled-friendly Blackwell variant. Announced at GTC 2024 alongside B200, it shares the same dual-reticle Blackwell silicon — two GPU dies connected by a 10 TB/s NV-HBI link — but operates within a 700 W envelope so that existing HGX-H100 baseboards and chassis can be upgraded without redesigning rack cooling.
In practice, most new Blackwell deployments specified B200 or GB200 directly. B100 occupies a narrow niche: customers with substantial investment in 700 W-class air-cooled infrastructure who want Blackwell's FP4 capability and the larger HBM3e capacity without rebuilding their data centre.
Specifications#
| Metric | B100 SXM |
|---|---|
| Architecture | Blackwell (dual-die) |
| Process | TSMC 4NP |
| Memory | 192 GB HBM3e |
| Memory bandwidth | 8 TB/s |
| FP8 (Tensor, sparse) | ~7,000 TFLOPS |
| FP4 (Tensor, sparse) | ~14,000 TFLOPS |
| NVLink | 1.8 TB/s (5.0) |
| TDP | 700 W |
| Form factor | SXM (HGX-compatible) |
Exact FP8/FP4 figures for B100 vary by published source. The qualitative picture — ~70 % of B200 throughput at 70 % of the power — is robust; treat absolute numbers as approximate.
Blackwell Innovations Carried Forward#
Even at reduced clocks, B100 inherits the full Blackwell feature set. The second-generation Transformer Engine adds FP4 (E2M1) and microscaling MX formats, the dual-die package presents a single CUDA device with a coherent HBM pool, and the decompression engine accelerates LZ4 and Snappy paths used in data-loading pipelines.
The fifth-generation NVLink at 1.8 TB/s per GPU is twice the H100 rate. Combined with a refreshed NVSwitch ASIC, NVL72-class racks scale to 72 GPUs at full bisection — a step change in pod-level fabric headroom over Hopper.
When B100 Makes Sense#
- Brownfield upgrades of HGX-H100 racks where rack power and cooling are already provisioned at 700 W per GPU.
- Workloads that benefit from FP4 inference or the dual-die memory pool but are not throughput-limited at the per-GPU level.
- Hybrid clusters mixing Hopper and Blackwell where matching the H100 thermal envelope simplifies operations.
- If your facility supports liquid cooling and 1,000+ W per GPU, B200 is the better choice on almost every axis.
- If you need the maximum density 'Grace + Blackwell' shared-memory super-pod, GB200 NVL72 is the only option.
Pitfalls#
- Treating B100 as 'just a slower B200' understates the inference cost gap — at iso-cost vs H200, B100 is usually a smaller win than headline FP4 numbers imply.
- FP4 weight quantisation requires careful per-tensor or per-channel scaling; naively casting BF16 weights to FP4 silently regresses accuracy on most production models.
- Software stack maturity for Blackwell lagged Hopper through 2024-2025; check kernel coverage in vLLM, TensorRT-LLM and SGLang for your specific model before committing.
Software Notes#
CUDA 12.4+ and driver R550+ are the minimum for Blackwell. TensorRT-LLM, vLLM, SGLang and Triton all gained Blackwell engines through 2024-2025, and Megatron-Core added FP4 training support in early 2026. Most Hopper-tuned kernels recompile cleanly but do not yet exploit FP4 or the new MX formats without explicit changes.
References
- NVIDIA Blackwell Architecture Overview · NVIDIA
- HGX Blackwell Platform Brief · NVIDIA