NVIDIA B100 Tensor Core GPU

TL;DR

First-generation Blackwell SKU positioned as a 700 W drop-in for HGX chassis that previously hosted H100 or H200.
Uses the same dual-die Blackwell package as B200 but with reduced clocks and a lower TDP ceiling, trading peak throughput for thermal compatibility.
Carries the second-generation Transformer Engine with native FP4 support and a fifth-generation NVLink at 1.8 TB/s per GPU.
Largely superseded by B200 in new deployments; B100 sees most use in upgrade paths where rack-level cooling cannot absorb 1,000 W per GPU.

Overview#

The B100 is the air-cooled-friendly Blackwell variant. Announced at GTC 2024 alongside B200, it shares the same dual-reticle Blackwell silicon — two GPU dies connected by a 10 TB/s NV-HBI link — but operates within a 700 W envelope so that existing HGX-H100 baseboards and chassis can be upgraded without redesigning rack cooling.

In practice, most new Blackwell deployments specified B200 or GB200 directly. B100 occupies a narrow niche: customers with substantial investment in 700 W-class air-cooled infrastructure who want Blackwell's FP4 capability and the larger HBM3e capacity without rebuilding their data centre.

Specifications#

Metric	B100 SXM
Architecture	Blackwell (dual-die)
Process	TSMC 4NP
Memory	192 GB HBM3e
Memory bandwidth	8 TB/s
FP8 (Tensor, sparse)	~7,000 TFLOPS
FP4 (Tensor, sparse)	~14,000 TFLOPS
NVLink	1.8 TB/s (5.0)
TDP	700 W
Form factor	SXM (HGX-compatible)

Exact FP8/FP4 figures for B100 vary by published source. The qualitative picture — ~70 % of B200 throughput at 70 % of the power — is robust; treat absolute numbers as approximate.

Blackwell Innovations Carried Forward#

Even at reduced clocks, B100 inherits the full Blackwell feature set. The second-generation Transformer Engine adds FP4 (E2M1) and microscaling MX formats, the dual-die package presents a single CUDA device with a coherent HBM pool, and the decompression engine accelerates LZ4 and Snappy paths used in data-loading pipelines.

The fifth-generation NVLink at 1.8 TB/s per GPU is twice the H100 rate. Combined with a refreshed NVSwitch ASIC, NVL72-class racks scale to 72 GPUs at full bisection — a step change in pod-level fabric headroom over Hopper.

When B100 Makes Sense#

Brownfield upgrades of HGX-H100 racks where rack power and cooling are already provisioned at 700 W per GPU.
Workloads that benefit from FP4 inference or the dual-die memory pool but are not throughput-limited at the per-GPU level.
Hybrid clusters mixing Hopper and Blackwell where matching the H100 thermal envelope simplifies operations.
If your facility supports liquid cooling and 1,000+ W per GPU, B200 is the better choice on almost every axis.
If you need the maximum density 'Grace + Blackwell' shared-memory super-pod, GB200 NVL72 is the only option.

Pitfalls#

Treating B100 as 'just a slower B200' understates the inference cost gap — at iso-cost vs H200, B100 is usually a smaller win than headline FP4 numbers imply.
FP4 weight quantisation requires careful per-tensor or per-channel scaling; naively casting BF16 weights to FP4 silently regresses accuracy on most production models.
Software stack maturity for Blackwell lagged Hopper through 2024-2025; check kernel coverage in vLLM, TensorRT-LLM and SGLang for your specific model before committing.

Software Notes#

CUDA 12.4+ and driver R550+ are the minimum for Blackwell. TensorRT-LLM, vLLM, SGLang and Triton all gained Blackwell engines through 2024-2025, and Megatron-Core added FP4 training support in early 2026. Most Hopper-tuned kernels recompile cleanly but do not yet exploit FP4 or the new MX formats without explicit changes.

References

NVIDIA Blackwell Architecture Overview · NVIDIA
HGX Blackwell Platform Brief · NVIDIA

Overview#

Specifications#

Metric	B100 SXM
Architecture	Blackwell (dual-die)
Process	TSMC 4NP
Memory	192 GB HBM3e
Memory bandwidth	8 TB/s
FP8 (Tensor, sparse)	~7,000 TFLOPS
FP4 (Tensor, sparse)	~14,000 TFLOPS
NVLink	1.8 TB/s (5.0)
TDP	700 W
Form factor	SXM (HGX-compatible)

Exact FP8/FP4 figures for B100 vary by published source. The qualitative picture — ~70 % of B200 throughput at 70 % of the power — is robust; treat absolute numbers as approximate.

Blackwell Innovations Carried Forward#

When B100 Makes Sense#

Brownfield upgrades of HGX-H100 racks where rack power and cooling are already provisioned at 700 W per GPU.

Workloads that benefit from FP4 inference or the dual-die memory pool but are not throughput-limited at the per-GPU level.

Hybrid clusters mixing Hopper and Blackwell where matching the H100 thermal envelope simplifies operations.

If your facility supports liquid cooling and 1,000+ W per GPU, B200 is the better choice on almost every axis.

If you need the maximum density 'Grace + Blackwell' shared-memory super-pod, GB200 NVL72 is the only option.

Pitfalls#

Treating B100 as 'just a slower B200' understates the inference cost gap — at iso-cost vs H200, B100 is usually a smaller win than headline FP4 numbers imply.

FP4 weight quantisation requires careful per-tensor or per-channel scaling; naively casting BF16 weights to FP4 silently regresses accuracy on most production models.

Software stack maturity for Blackwell lagged Hopper through 2024-2025; check kernel coverage in vLLM, TensorRT-LLM and SGLang for your specific model before committing.

Software Notes#

NVIDIA B100 Tensor Core GPU

Overview#

Specifications#

Blackwell Innovations Carried Forward#

When B100 Makes Sense#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel

NVIDIA B100 Tensor Core GPU

Overview#

Specifications#

Blackwell Innovations Carried Forward#

When B100 Makes Sense#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel