TL;DR
- Launched 2024; aimed at H100 on training and inference economics with a 5 nm process and refreshed MME design.
- 128 GB HBM2e at 3.7 TB/s; 1.8 PFLOPS BF16, 1.8 PFLOPS FP8 (matrix engine throughput).
- 24 integrated 200 GbE RoCE ports — direct successor to Gaudi 2's networking story.
- Production deployments at IBM Cloud and on-prem partners; Intel's roadmap for a Gaudi successor is unclear post-Falcon Shores consolidation.
Overview#
Gaudi 3 is Intel's most recent Habana-lineage accelerator, launched in 2024 with the headline claim of H100-class performance at a meaningful price discount. The architectural pattern is the same — TPCs plus MMEs, SynapseAI compiler, integrated RoCE networking — but the silicon moves to TSMC 5 nm and memory grows to 128 GB HBM2e.
The competitive position is reasonable: published benchmarks show Gaudi 3 trading wins with H100 on transformer training and Llama-class inference. The software gap remains the main barrier to adoption; teams not already invested in SynapseAI face a non-trivial onboarding cost.
Specifications#
| Metric | Gaudi 3 |
|---|---|
| Architecture | Habana custom (refreshed TPC + MME) |
| Process | TSMC 5 nm |
| BF16 (MME) | 1,835 TFLOPS |
| FP8 (MME) | 1,835 TFLOPS |
| Memory | 128 GB HBM2e |
| Memory bandwidth | 3.7 TB/s |
| TDP | 900 W |
| Integrated networking | 24× 200 GbE RoCE |
| Form factor | OAM 2.0 |
Architecture and Networking#
Gaudi 3 doubles the MME throughput per chip versus Gaudi 2 and refreshes the TPC ISA. The split between TPCs (general compute) and MMEs (dense matmul) remains the central programming abstraction. SynapseAI handles the scheduling and graph lowering.
The integrated 200 GbE RoCE story is the operational highlight. A standard 8-card Gaudi 3 server provides 4.8 Tb/s of GPU-attached networking without separate NICs. For sovereign or budget-sensitive builds where InfiniBand procurement is awkward, this remains genuinely useful.
When to Pick Gaudi 3#
- Cost-sensitive training of 7B-70B models where SynapseAI tooling is acceptable.
- Clusters where integrated Ethernet fabric simplifies the scale-out story.
- Sovereign and supply-diversified deployments seeking a credible non-NVIDIA / non-AMD path.
- Pick H100 / H200 if CUDA ecosystem reach is required.
- Pick MI300X / MI325X for larger HBM pools per device.
Pitfalls#
- Software ecosystem is narrower; many post-2024 LLM optimisations land on CUDA first.
- Roadmap uncertainty — Intel's Falcon Shores plans were repeatedly revised through 2024-2025.
- Compiler-first workflow can produce surprising performance cliffs.
- HBM2e (not HBM3 or HBM3e) limits decode-bound inference throughput relative to H100/H200.
Software Notes#
SynapseAI 1.x and Habana PyTorch remain the production paths. Optimum-Habana provides ready-made recipes for Llama, Mistral, Mixtral and other common models. vLLM has a maintained Habana backend; TensorRT-LLM and SGLang remain NVIDIA-specific.
References
- Intel Gaudi 3 Product Brief · Intel
- Gaudi 3 Whitepaper · Intel