TL;DR
- Intel/Habana training accelerator launched May 2022 — 96 GB HBM2e, 24 integrated 100 GbE ports.
- Distinctive feature: per-card RoCE networking eliminates the need for separate InfiniBand cards in clusters.
- BF16 throughput ~432 TFLOPS, comparable to A100 80 GB on a price/perf basis in 2023-2024 benchmarks.
- Largely displaced in new buys by Gaudi 3 and NVIDIA H200; remains in production at AWS DL1 and Intel Tiber AI Cloud.
Overview#
Gaudi 2 is the second generation of Habana Labs' training accelerator family (Habana was acquired by Intel in 2019). Launched in May 2022, it pairs a custom TPC + Matrix Multiplication Engine architecture with 96 GB of HBM2e and — uniquely — 24 integrated 100 GbE RoCE ports. The on-chip networking is the architectural differentiator: a Gaudi 2 server needs no separate NICs to participate in a scale-out fabric.
Commercially, Gaudi 2's main public deployments were AWS DL1 instances and Intel's Tiber Developer Cloud (formerly Intel Developer Cloud). The hardware was technically competitive with A100 on price-performance for transformer training, but software ecosystem reach limited adoption.
Specifications#
| Metric | Gaudi 2 |
|---|---|
| Architecture | Habana custom (TPC + MME) |
| Process | TSMC 7 nm |
| BF16 (MME) | ~432 TFLOPS |
| FP8 | Supported (E4M3 / E5M2) |
| Memory | 96 GB HBM2e |
| Memory bandwidth | 2.45 TB/s |
| TDP | 600 W |
| Integrated networking | 24× 100 GbE RoCE |
| Form factor | OAM |
Gaudi 2's integrated RoCE is genuinely distinctive: a 64-card cluster can scale out without InfiniBand HCAs, simplifying procurement and BoM.
Architecture Notes#
Gaudi 2 uses two compute primitives. The Tensor Processor Cores (TPCs) — 24 of them — handle vector and elementwise operations. The Matrix Multiplication Engines (MMEs) handle dense matmul. The split is conceptually similar to NVIDIA's CUDA cores + Tensor Cores but with different scheduling primitives and a custom ISA.
Programming targets Habana's SynapseAI graph compiler and Tensor Processor Core compiler. PyTorch integration goes through Habana's plugin, which lowers PyTorch graphs onto SynapseAI; the developer experience is closer to a JIT compiler than a runtime device API.
When Gaudi 2 Made Sense#
- Cost-sensitive training where A100-class performance at a discount mattered.
- Clusters that benefit from integrated networking and BoM simplification.
- Workloads already developed against SynapseAI / Habana PyTorch.
- By 2026 — pick Gaudi 3 for new buys, or NVIDIA/AMD where ecosystem matters more.
Pitfalls#
- Software ecosystem narrower than CUDA — many specialised libraries (Flash Attention, custom Triton kernels) require Habana-specific equivalents.
- Multi-vendor portability: code targeting Gaudi 2 needs significant adaptation to run on CUDA or ROCm.
- Compiler-driven workflow can hide performance cliffs that are obvious in a runtime-API model.
- Long-term Intel commitment to Gaudi has been publicly questioned through 2024-2025.
Software Ecosystem#
SynapseAI + Habana PyTorch is the production path. Hugging Face Optimum-Habana offers turnkey training and inference recipes for popular models. vLLM gained limited Gaudi support in 2024; SGLang and TensorRT-LLM are NVIDIA-only.
References
- Intel Gaudi 2 Product Brief · Intel
- SynapseAI Documentation · Intel / Habana