TL;DR
- Mid-cycle Blackwell Ultra SKU announced at GTC 2024 and shipping from late 2025, lifting HBM3e to 288 GB and FP4 throughput by roughly 1.5×.
- Same dual-die Blackwell package as B200 with refreshed memory stacks (12-high HBM3e) and revised power binning.
- Targeted at reasoning-model inference where long chains-of-thought consume enormous KV caches and FP4 throughput dictates cost.
- Drop-in compatible with HGX-B200 baseboards in most OEM designs, easing the upgrade path from B200-era clusters.
Overview#
B300 — marketed as 'Blackwell Ultra' — is the mid-cycle Blackwell refresh, much as H200 was for Hopper. The compute architecture is the same dual-die Blackwell silicon as B200, but the HBM3e stacks step up to 12-high, raising per-GPU capacity to 288 GB. NVIDIA also rebinned the part for higher sustained FP4 throughput.
Positioning is squarely at reasoning-model inference and long-context workloads. The 50 % memory uplift over B200 directly translates into longer contexts, larger batches, and the ability to host frontier-MoE inference replicas on fewer GPUs.
Specifications vs B200#
| Metric | B300 | B200 |
|---|---|---|
| Architecture | Blackwell Ultra (dual-die) | Blackwell (dual-die) |
| Memory | 288 GB HBM3e (12-high) | 192 GB HBM3e (8-high) |
| Memory bandwidth | ~8 TB/s | 8 TB/s |
| FP4 (Tensor, sparse) | ~27,000 TFLOPS | 18,000 TFLOPS |
| NVLink | 1.8 TB/s (5.0) | 1.8 TB/s (5.0) |
| TDP | ~1,400 W | 1,000 W |
| Form factor | SXM / NVL | SXM / NVL |
Public B300 specifications were still being finalised at launch; figures here reflect NVIDIA's GTC 2024 announcements and may be revised. Memory capacity (288 GB) and the general 1.5× FP4 uplift are the load-bearing claims.
Why a Mid-Cycle Refresh#
The B300 refresh exists for the same reason H200 did: HBM density and bandwidth were the binding constraint on inference economics, not raw compute. Through 2024-2025, SK hynix and Micron 12-high HBM3e stacks reached production; B300 is the first SKU to package them at scale.
Reasoning models — models that emit long internal chains-of-thought before producing a user-visible answer — amplify the memory pressure. A frontier reasoning model serving 100K-token contexts at batch 64 can consume hundreds of gigabytes of KV cache per replica; the 288 GB-per-GPU envelope is the practical difference between one replica and four.
When to Pick B300#
- Inference of reasoning models with long internal chains-of-thought, where KV-cache pressure dominates.
- Frontier MoE inference where activated-parameter memory plus expert KV state exceeds the B200 envelope.
- Single-GPU replicas of 200B+ parameter dense models where avoiding tensor parallelism is worth the price premium.
- Pick B200 when supply or cost dominate and your KV-cache budget fits in 192 GB.
- Pick GB300 NVL72 when the unit of work is a full pod and the larger Grace coupling helps.
Operational Notes#
- Higher TDP than B200 (estimated ~1,400 W) — rack power budget should be revisited even on existing liquid-cooled designs.
- Drop-in compatibility with HGX-B200 baseboards is OEM-specific; confirm with your platform vendor before assuming a clean swap.
- Software stack is identical to B200 — CUDA 12.4+, driver R550+, TensorRT-LLM, vLLM Blackwell backend.
- HBM3e 12-high supply remains the primary availability constraint through 2026.
Software Ecosystem#
B300 inherits the B200 software stack without changes. Kernels tuned for Blackwell SMs run identically; the only consideration is making sure inference servers (vLLM, TensorRT-LLM, SGLang) are configured to use the full 288 GB rather than defaulting to 192 GB budgets carried over from B200 deployments.
References
- NVIDIA Blackwell Ultra Announcement · NVIDIA
- GTC 2024 Keynote — Blackwell Ultra · NVIDIA