NVIDIA B300 Tensor Core GPU

TL;DR

Mid-cycle Blackwell Ultra SKU announced at GTC 2024 and shipping from late 2025, lifting HBM3e to 288 GB and FP4 throughput by roughly 1.5×.
Same dual-die Blackwell package as B200 with refreshed memory stacks (12-high HBM3e) and revised power binning.
Targeted at reasoning-model inference where long chains-of-thought consume enormous KV caches and FP4 throughput dictates cost.
Drop-in compatible with HGX-B200 baseboards in most OEM designs, easing the upgrade path from B200-era clusters.

Overview#

B300 — marketed as 'Blackwell Ultra' — is the mid-cycle Blackwell refresh, much as H200 was for Hopper. The compute architecture is the same dual-die Blackwell silicon as B200, but the HBM3e stacks step up to 12-high, raising per-GPU capacity to 288 GB. NVIDIA also rebinned the part for higher sustained FP4 throughput.

Positioning is squarely at reasoning-model inference and long-context workloads. The 50 % memory uplift over B200 directly translates into longer contexts, larger batches, and the ability to host frontier-MoE inference replicas on fewer GPUs.

Specifications vs B200#

Metric	B300	B200
Architecture	Blackwell Ultra (dual-die)	Blackwell (dual-die)
Memory	288 GB HBM3e (12-high)	192 GB HBM3e (8-high)
Memory bandwidth	~8 TB/s	8 TB/s
FP4 (Tensor, sparse)	~27,000 TFLOPS	18,000 TFLOPS
NVLink	1.8 TB/s (5.0)	1.8 TB/s (5.0)
TDP	~1,400 W	1,000 W
Form factor	SXM / NVL	SXM / NVL

Public B300 specifications were still being finalised at launch; figures here reflect NVIDIA's GTC 2024 announcements and may be revised. Memory capacity (288 GB) and the general 1.5× FP4 uplift are the load-bearing claims.

Why a Mid-Cycle Refresh#

The B300 refresh exists for the same reason H200 did: HBM density and bandwidth were the binding constraint on inference economics, not raw compute. Through 2024-2025, SK hynix and Micron 12-high HBM3e stacks reached production; B300 is the first SKU to package them at scale.

Reasoning models — models that emit long internal chains-of-thought before producing a user-visible answer — amplify the memory pressure. A frontier reasoning model serving 100K-token contexts at batch 64 can consume hundreds of gigabytes of KV cache per replica; the 288 GB-per-GPU envelope is the practical difference between one replica and four.

When to Pick B300#

Inference of reasoning models with long internal chains-of-thought, where KV-cache pressure dominates.
Frontier MoE inference where activated-parameter memory plus expert KV state exceeds the B200 envelope.
Single-GPU replicas of 200B+ parameter dense models where avoiding tensor parallelism is worth the price premium.
Pick B200 when supply or cost dominate and your KV-cache budget fits in 192 GB.
Pick GB300 NVL72 when the unit of work is a full pod and the larger Grace coupling helps.

Operational Notes#

Higher TDP than B200 (estimated ~1,400 W) — rack power budget should be revisited even on existing liquid-cooled designs.
Drop-in compatibility with HGX-B200 baseboards is OEM-specific; confirm with your platform vendor before assuming a clean swap.
Software stack is identical to B200 — CUDA 12.4+, driver R550+, TensorRT-LLM, vLLM Blackwell backend.
HBM3e 12-high supply remains the primary availability constraint through 2026.

Software Ecosystem#

B300 inherits the B200 software stack without changes. Kernels tuned for Blackwell SMs run identically; the only consideration is making sure inference servers (vLLM, TensorRT-LLM, SGLang) are configured to use the full 288 GB rather than defaulting to 192 GB budgets carried over from B200 deployments.

References

NVIDIA Blackwell Ultra Announcement · NVIDIA
GTC 2024 Keynote — Blackwell Ultra · NVIDIA

Overview#

Specifications vs B200#

Metric	B300	B200
Architecture	Blackwell Ultra (dual-die)	Blackwell (dual-die)
Memory	288 GB HBM3e (12-high)	192 GB HBM3e (8-high)
Memory bandwidth	~8 TB/s	8 TB/s
FP4 (Tensor, sparse)	~27,000 TFLOPS	18,000 TFLOPS
NVLink	1.8 TB/s (5.0)	1.8 TB/s (5.0)
TDP	~1,400 W	1,000 W
Form factor	SXM / NVL	SXM / NVL

Why a Mid-Cycle Refresh#

When to Pick B300#

Inference of reasoning models with long internal chains-of-thought, where KV-cache pressure dominates.

Frontier MoE inference where activated-parameter memory plus expert KV state exceeds the B200 envelope.

Single-GPU replicas of 200B+ parameter dense models where avoiding tensor parallelism is worth the price premium.

Pick B200 when supply or cost dominate and your KV-cache budget fits in 192 GB.

Pick GB300 NVL72 when the unit of work is a full pod and the larger Grace coupling helps.

Operational Notes#

Higher TDP than B200 (estimated ~1,400 W) — rack power budget should be revisited even on existing liquid-cooled designs.

Drop-in compatibility with HGX-B200 baseboards is OEM-specific; confirm with your platform vendor before assuming a clean swap.

Software stack is identical to B200 — CUDA 12.4+, driver R550+, TensorRT-LLM, vLLM Blackwell backend.

HBM3e 12-high supply remains the primary availability constraint through 2026.

Software Ecosystem#

NVIDIA B300 Tensor Core GPU

Overview#

Specifications vs B200#

Why a Mid-Cycle Refresh#

When to Pick B300#

Operational Notes#

Software Ecosystem#

References

Browse all entries

Deploy on Yobitel

NVIDIA B300 Tensor Core GPU

Overview#

Specifications vs B200#

Why a Mid-Cycle Refresh#

When to Pick B300#

Operational Notes#

Software Ecosystem#

References

Browse all entries

Deploy on Yobitel