Intel Gaudi 2 Accelerator

TL;DR

Intel/Habana training accelerator launched May 2022 — 96 GB HBM2e, 24 integrated 100 GbE ports.
Distinctive feature: per-card RoCE networking eliminates the need for separate InfiniBand cards in clusters.
BF16 throughput ~432 TFLOPS, comparable to A100 80 GB on a price/perf basis in 2023-2024 benchmarks.
Largely displaced in new buys by Gaudi 3 and NVIDIA H200; remains in production at AWS DL1 and Intel Tiber AI Cloud.

Overview#

Gaudi 2 is the second generation of Habana Labs' training accelerator family (Habana was acquired by Intel in 2019). Launched in May 2022, it pairs a custom TPC + Matrix Multiplication Engine architecture with 96 GB of HBM2e and — uniquely — 24 integrated 100 GbE RoCE ports. The on-chip networking is the architectural differentiator: a Gaudi 2 server needs no separate NICs to participate in a scale-out fabric.

Commercially, Gaudi 2's main public deployments were AWS DL1 instances and Intel's Tiber Developer Cloud (formerly Intel Developer Cloud). The hardware was technically competitive with A100 on price-performance for transformer training, but software ecosystem reach limited adoption.

Specifications#

Metric	Gaudi 2
Architecture	Habana custom (TPC + MME)
Process	TSMC 7 nm
BF16 (MME)	~432 TFLOPS
FP8	Supported (E4M3 / E5M2)
Memory	96 GB HBM2e
Memory bandwidth	2.45 TB/s
TDP	600 W
Integrated networking	24× 100 GbE RoCE
Form factor	OAM

Gaudi 2's integrated RoCE is genuinely distinctive: a 64-card cluster can scale out without InfiniBand HCAs, simplifying procurement and BoM.

Architecture Notes#

Gaudi 2 uses two compute primitives. The Tensor Processor Cores (TPCs) — 24 of them — handle vector and elementwise operations. The Matrix Multiplication Engines (MMEs) handle dense matmul. The split is conceptually similar to NVIDIA's CUDA cores + Tensor Cores but with different scheduling primitives and a custom ISA.

Programming targets Habana's SynapseAI graph compiler and Tensor Processor Core compiler. PyTorch integration goes through Habana's plugin, which lowers PyTorch graphs onto SynapseAI; the developer experience is closer to a JIT compiler than a runtime device API.

When Gaudi 2 Made Sense#

Cost-sensitive training where A100-class performance at a discount mattered.
Clusters that benefit from integrated networking and BoM simplification.
Workloads already developed against SynapseAI / Habana PyTorch.
By 2026 — pick Gaudi 3 for new buys, or NVIDIA/AMD where ecosystem matters more.

Pitfalls#

Software ecosystem narrower than CUDA — many specialised libraries (Flash Attention, custom Triton kernels) require Habana-specific equivalents.
Multi-vendor portability: code targeting Gaudi 2 needs significant adaptation to run on CUDA or ROCm.
Compiler-driven workflow can hide performance cliffs that are obvious in a runtime-API model.
Long-term Intel commitment to Gaudi has been publicly questioned through 2024-2025.

Software Ecosystem#

SynapseAI + Habana PyTorch is the production path. Hugging Face Optimum-Habana offers turnkey training and inference recipes for popular models. vLLM gained limited Gaudi support in 2024; SGLang and TensorRT-LLM are NVIDIA-only.

References

Intel Gaudi 2 Product Brief · Intel
SynapseAI Documentation · Intel / Habana

Overview#

Specifications#

Metric	Gaudi 2
Architecture	Habana custom (TPC + MME)
Process	TSMC 7 nm
BF16 (MME)	~432 TFLOPS
FP8	Supported (E4M3 / E5M2)
Memory	96 GB HBM2e
Memory bandwidth	2.45 TB/s
TDP	600 W
Integrated networking	24× 100 GbE RoCE
Form factor	OAM

Gaudi 2's integrated RoCE is genuinely distinctive: a 64-card cluster can scale out without InfiniBand HCAs, simplifying procurement and BoM.

Architecture Notes#

Pitfalls#

Software ecosystem narrower than CUDA — many specialised libraries (Flash Attention, custom Triton kernels) require Habana-specific equivalents.

Multi-vendor portability: code targeting Gaudi 2 needs significant adaptation to run on CUDA or ROCm.

Compiler-driven workflow can hide performance cliffs that are obvious in a runtime-API model.

Long-term Intel commitment to Gaudi has been publicly questioned through 2024-2025.

Intel Gaudi 2 Accelerator

Overview#

Specifications#

Architecture Notes#

When Gaudi 2 Made Sense#

Pitfalls#

Software Ecosystem#

References

Browse all entries

Deploy on Yobitel

Intel Gaudi 2 Accelerator

Overview#

Specifications#

Architecture Notes#

When Gaudi 2 Made Sense#

Pitfalls#

Software Ecosystem#

References

Browse all entries

Deploy on Yobitel