TL;DR
- Second-generation inference chip launched 2023 — 32 GB HBM, ~190 TFLOPS BF16 per chip.
- Inf2 instances scale to 12 chips with 384 GB total HBM and high-bandwidth NeuronLink between chips.
- Targets LLM inference at 7B-70B scale on AWS where cost per token dominates.
- Shares NeuronCore v2 architecture with Trainium 1, simplifying software portability.
Overview#
Inferentia 2 brings HBM and modern NeuronCore v2 silicon to AWS inference. Launched in 2023 and rolled out across EC2 inf2 instance sizes, it gives AWS customers a credible non-GPU path for LLM-class inference workloads.
Architecturally it shares NeuronCore v2 with Trainium 1 — code that runs on one usually runs on the other with modest changes. The differentiation is largely in memory configuration and instance shape.
Specifications#
| Metric | Inferentia 2 (per chip) |
|---|---|
| NeuronCores | 2 (NeuronCore v2) |
| BF16 | ~190 TFLOPS |
| FP8 | Supported |
| Memory | 32 GB HBM |
| Memory bandwidth | 820 GB/s |
| Inter-chip link | NeuronLink v2 |
| Inf2 instance | 1-12 chips, up to 384 GB HBM |
When to Pick Inferentia 2#
- LLM inference on AWS where Neuron SDK has been integrated and cost per token matters.
- 7B-70B inference where 384 GB cluster HBM is sufficient.
- Workloads that benefit from AWS-native integration with Bedrock and SageMaker.
- Pick L40S / H100 if CUDA ecosystem reach (vLLM features, custom kernels) is required.
- Pick Inferentia 3 / Trainium 2 for newer chips as they become available.
Pitfalls#
- Neuron SDK operator coverage lags PyTorch+CUDA for cutting-edge model architectures.
- Long-context attention paths can underperform GPU equivalents.
- AWS-exclusive.
- Calibration and quantisation tooling less mature than NVIDIA TensorRT-LLM.
Software Notes#
AWS Neuron SDK with NxD Inference, Optimum-Neuron and PyTorch/XLA. Hugging Face provides Inferentia 2 recipes for common models. SageMaker and Bedrock support Inf2-backed endpoints natively.