AMD Instinct MI300X Accelerator

TL;DR

CDNA 3-based data centre GPU announced December 2023 (Advancing AI event), volume shipping Q1 2024. Eight XCD compute dies (304 Compute Units total) plus four IOD memory dies on a single 3D-stacked CoWoS package, with eight HBM3 stacks totalling 192 GB at 5.3 TB/s — the largest single-package HBM pool of its generation.
Native FP8 support via the Matrix Core engine: 1,307 TFLOPS BF16/FP16 (sparse), 2,614 TFLOPS FP8 (sparse), 2,614 TOPS INT8 (sparse). FLOPS sit between H100 SXM5 and H100 PCIe Gen5 in practice; serving wins come from the 192 GB pool, not raw throughput.
OAM form factor at 750 W TDP. The 8-GPU baseboard uses AMD Infinity Fabric (xGMI) at 896 GB/s aggregate per GPU — not equivalent to NVLink Switch System but enough for 8-GPU non-blocking topologies; multi-node fabrics use 400 Gb/s Ethernet or InfiniBand NDR.
Software stack: ROCm 6.x (6.3+ recommended), PyTorch ROCm backend, vLLM with first-class MI300X paths, SGLang MI300X engines added 2025, hipBLASLt + AITER (AMD's TensorRT-LLM equivalent), RCCL as the NCCL replacement. TensorRT-LLM does NOT run on AMD; CUDA kernels are not portable.
Pricing through 2026: roughly $4.00/GPU-hr on-demand, $3.00 one-year reserved, $2.45 three-year reserved. Materially cheaper than H100 SXM5 at the same commitment tier and the multi-vendor signal that anchors enterprise procurement strategy.

Overview#

MI300X is the GPU AMD bet the data centre on, and the part that put ROCm into hyperscaler production. Announced at AMD's Advancing AI event in December 2023 and shipping in volume from Q1 2024, it packages eight CDNA 3 compute chiplets (XCDs, 38 Compute Units each, 304 CUs total) and four memory I/O dies (IODs) on a single 3D-stacked CoWoS package, with eight HBM3 stacks around the perimeter totalling 192 GB at 5.3 TB/s. No 2024-era NVIDIA part — H100, H200 (141 GB) or B100 (180 GB) — matched that per-GPU memory ceiling.

The pitch is straightforward and the procurement reality matches it. Inference of dense LLMs at 70B-180B parameters fits on a single MI300X where it would require tensor parallelism across two-to-four H100s, simplifying serving topology, eliminating an entire class of NCCL collective overhead and reducing tail latency. AMD priced the part as a credible second source for large-model inference, and through 2024-2025 it shipped in volume to Microsoft (Azure ND MI300X), Meta (production llama serving), Oracle (OCI BM.GPU.MI300X.8) and a long tail of Tier-1/Tier-2 neoclouds. By 2026, MI300X is the established multi-vendor anchor in serious enterprise GPU procurement — the part that lets a CIO say 'not single-vendor' without buying a science project.

The honest trade-off is software. ROCm has closed most of the CUDA gap through 2024-2026 — PyTorch, vLLM, SGLang, Hugging Face Transformers, Flash Attention 2/3, paged-KV attention and FP8 GEMM all run with parity or near-parity to CUDA — but specific paths (TensorRT-LLM, certain custom Triton kernels, some bleeding-edge quantisation schemes) remain CUDA-exclusive. The MI300X case is strongest for inference of widely-supported open-weight models on ROCm-compatible serving stacks; it weakens for teams whose production path is tied to TensorRT-LLM engines or NVIDIA's TAO / ModelOpt toolchain.

This entry is the reference for teams operating MI300X alongside or instead of NVIDIA Hopper/Blackwell: full spec sheet, the sizing tables we use on InferenceBench, the ROCm software-stack baseline, the operational issues that are MI300X-specific (driver pinning, RCCL replacing NCCL, xGMI topology vs NVLink-fabric expectations), and where the part fits in the Yobitel multi-vendor compute strategy. Yobitel NeoCloud offers MI300X OAM capacity in UK and EU regions with NCSC OFFICIAL alignment as the deliberate multi-vendor anchor alongside the NVIDIA H-series fleet, and Yobibyte schedules ROCm-compatible serving workloads onto MI300X pools whenever residency and software stack permit. This entry helps you decide when MI300X is the right pick for your workload and how to size and price it on Yobitel NeoCloud or your own cluster.

How it works: CDNA 3 chiplet architecture and the Matrix Core engine#

MI300X breaks the GPU into chiplets. Eight XCD dies — each containing 38 CDNA 3 Compute Units, 304 CUs in total — sit on top of four IO dies (IODs) that provide HBM3 memory controllers, AMD Infinity Cache (256 MB), and the inter-chiplet Infinity Fabric. All of this is 3D-stacked on a CoWoS organic substrate with eight HBM3 stacks (24 GB each, 192 GB total) around the perimeter. The chiplet approach is what gave AMD time-to-market in 2023-2024: smaller dies yield better than monolithic 800 mm² parts, and the same XCDs are reused in MI300A (the APU variant pairing GPU chiplets with Zen 4 CPU chiplets) without redesigning the silicon.

Each CU contains 64 stream processors organised across SIMD units, four AMD Matrix Cores (the equivalent of NVIDIA Tensor cores), 64 KB of LDS (Local Data Share, the AMD equivalent of CUDA shared memory) and dedicated wavefront schedulers. Matrix Cores execute mixed-precision dot products natively for FP64, FP32, TF32, BF16, FP16, FP8 (E4M3 and E5M2, matching NVIDIA's convention), INT8 and INT4. The Matrix Core engine on CDNA 3 supports 2:4 structured sparsity for inference workloads, with the same throughput-doubling pattern as Tensor cores.

The 256 MB Infinity Cache on the IODs is the secret sauce for memory-bound serving. It absorbs reused KV-cache reads and weight-tensor tiles, dramatically reducing HBM bandwidth pressure on long-context decode. Combined with the 5.3 TB/s HBM3 ceiling, MI300X frequently sustains higher effective bandwidth on real LLM serving than the headline number suggests — the 'serving win' on memory-bound shapes is wider than a pure bandwidth comparison would predict.

The trade-off is software complexity at chiplet boundaries. Crossing XCD-to-XCD boundaries adds latency and bandwidth constraints that the runtime must be aware of; ROCm libraries were tuned through 2024-2025 to hide most of this, but specific kernel patterns (small-batch, irregular memory access, cross-XCD reductions) can still hit chiplet-edge bottlenecks. The MI300X case is strongest on regular, batchable workloads (LLM serving with paged KV, batched training) and weakest on small-batch latency-critical inference.

Eight XCDs (CDNA 3 compute dies) + four IODs (memory/IO dies) on a 3D-stacked CoWoS package.
304 Compute Units total (38 CUs per XCD); 1,216 Matrix Cores (4 per CU); 19,456 stream processors.
Eight HBM3 stacks (24 GB each) totalling 192 GB at 5.3 TB/s.
256 MB AMD Infinity Cache on the IODs; bandwidth amplifier for memory-bound serving.
Native FP8 E4M3 / E5M2, BF16, FP16, TF32, FP64; 2:4 structured sparsity supported.
xGMI (Infinity Fabric over PCIe): 128 GB/s per link, 7 links per GPU on an OAM baseboard = 896 GB/s aggregate.
OAM form factor at 750 W TDP; air-cooled chassis viable up to 8 GPUs per node.

Subsystem	CDNA 3 detail	Practical consequence
Matrix Core (CDNA 3)	FP8 E4M3/E5M2, BF16, FP16, TF32, INT8, INT4	Inference parity with H100 Tensor cores on supported precisions.
Infinity Cache (256 MB on IODs)	Coherent L3-equivalent; KV-cache and weight reuse	Bandwidth amplifier; long-context decode sustains higher effective bandwidth than HBM3 raw.
Chiplet design	8 XCDs + 4 IODs 3D-stacked on CoWoS	Yield headroom; smaller dies; ships more transistors per dollar.
xGMI fabric	896 GB/s aggregate per GPU on 8-GPU OAM baseboard	Comparable to NVLink 4.0 inside one node; NOT equivalent to NVLink Switch System across nodes.
HBM3 stack count	8 stacks x 24 GB = 192 GB at 5.3 TB/s	Largest single-package HBM pool of 2024-2025; 70B-180B serving on 1 card.

Reference: full specification sheet#

Authoritative figures for the MI300X OAM SKU at standard 750 W TDP. AMD's 'Matrix' throughput numbers are roughly comparable to NVIDIA's 'Tensor' numbers but use different sparsity conventions — confirm whether published figures are dense or sparse before comparing.

Metric	MI300X (OAM)
Architecture	CDNA 3
Process	TSMC 5 nm (XCDs) + 6 nm (IODs)
Transistors	153 billion
Chiplets	8 XCDs + 4 IODs + 8 HBM3 stacks on CoWoS
Compute Units	304 (38 per XCD)
Matrix Cores	1,216 (4 per CU)
Stream processors	19,456
Infinity Cache (L3-equivalent)	256 MB
LDS (per CU)	64 KB
FP64 (Matrix)	163 TFLOPS
FP64 (Vector)	81 TFLOPS
FP32 (Matrix)	163 TFLOPS
FP32 (Vector)	81 TFLOPS
TF32 (Matrix, sparse)	653 TFLOPS
BF16 / FP16 (Matrix, sparse)	1,307 TFLOPS
BF16 / FP16 (Matrix, dense)	653 TFLOPS
FP8 (Matrix, sparse)	2,614 TFLOPS
FP8 (Matrix, dense)	1,307 TFLOPS
INT8 (Matrix, sparse)	2,614 TOPS
Memory	192 GB HBM3 (8 stacks x 24 GB)
Memory bandwidth	5.3 TB/s
TDP	750 W
xGMI (Infinity Fabric)	896 GB/s aggregate (7 links x 128 GB/s)
PCIe	Gen5 x16 (128 GB/s)
Form factor	OAM 1.5
Cooling	Air-cooled OAM chassis up to 8 GPUs / node
Software baseline	ROCm 6.3+ recommended
Compute capability (gfx target)	gfx942
Multi-tenant partitioning	SR-IOV (single root I/O virtualisation), compute-partition modes

AMD documents MI300X throughput in both 'Matrix' (sparse) and 'Vector' (dense scalar) forms — the comparison to NVIDIA's 'Tensor' figures is Matrix-to-Tensor. Quote dense Matrix figures in capacity plans (half the sparse number) and treat sparse as the marketing ceiling, exactly the same convention as the H100 entry.

Workload pattern A: Llama 3.1 70B BF16 on a single MI300X#

The pattern that defined MI300X's serving value proposition. Llama 3.1 70B in BF16 occupies roughly 140 GB of weight memory — fits on one MI300X with 52 GB headroom for KV cache, working activations and Infinity Cache pressure. On H100 SXM5 this requires TP=4 across four cards (80 GB each is not enough for weights alone); on MI300X it runs as a single-card replica, eliminating NCCL/RCCL collectives and halving the inter-GPU traffic budget. Throughput at 1,200-1,500 output TPS sustained is competitive with H100 TP=4 at materially lower cost.

bash

# 70B BF16 on 1x MI300X with vLLM + ROCm 6.3
HIP_VISIBLE_DEVICES=0 vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1 \
  --dtype bfloat16 \
  --max-model-len 131072 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --host 0.0.0.0 --port 8000

# Optional: pre-quantise to FP8 with AMD Quark for ~1.6x throughput
quark-quantize \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --output ./Llama-3.1-70B-Instruct-FP8-AMD \
  --quant-format fp8 --calib-dataset wikitext-2

HIP_VISIBLE_DEVICES=0 vllm serve ./Llama-3.1-70B-Instruct-FP8-AMD \
  --tensor-parallel-size 1 \
  --quantization fp8 --kv-cache-dtype fp8_e5m2 \
  --max-model-len 131072 --max-num-seqs 64 \
  --gpu-memory-utilization 0.92

Pattern A is the strongest case for MI300X. A single-card 70B serving replica eliminates inter-GPU collectives, reduces tail latency, and at $4.00/GPU-hr on-demand vs 4x H100 SXM5 at $10.00/replica-hr, cost-per-token clears under $0.40 per million tokens — competitive with H100 H8 FP8 and ~40 % cheaper than H100 BF16.

Workload pattern B: 8x MI300X training cluster#

Multi-GPU training on an 8x MI300X OAM baseboard with RCCL (the ROCm replacement for NCCL) handling collectives. Same PyTorch / Megatron-LM patterns as on NVIDIA — the code changes are minimal — but with `HIP_VISIBLE_DEVICES` instead of `CUDA_VISIBLE_DEVICES`, RCCL instead of NCCL, and Flash Attention's ROCm port (`flash-attn` with the `ROCM=1` build flag). Training a 13B model on 100B tokens completes in roughly 6-9 days on 8x MI300X — between H100 SXM5 and H100 PCIe Gen5 in practice on FLOPS-bound runs.

RCCL is API-compatible with NCCL; PyTorch keeps the backend name 'nccl' for source compatibility.
Flash Attention 2 is fully ported to ROCm; Flash Attention 3 ports landed in ROCm 6.3 (production-acceptable from late 2025).
Megatron-Core supports ROCm with the same TP/PP/DP semantics as on NVIDIA; rebuild from source against ROCm 6.3.
MFU on training: 50-60 % typical on MI300X for 7B-13B BF16, slightly below H100's 60-65 % at iso-precision.

python

# train_13b.py — 13B BF16 pretraining on 8x MI300X with PyTorch + Megatron-LM
# Deps: pip install "torch==2.4.0+rocm6.3" megatron-core flash-attn

import os, torch
import torch.distributed as dist
import torch.nn.parallel as ddp
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

os.environ.setdefault("NCCL_DEBUG", "INFO")
# RCCL is API-compatible with NCCL; PyTorch uses 'nccl' as the backend name on ROCm too.
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)  # On ROCm, torch.cuda maps to HIP

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # ROCm port of FA2
).to(local_rank)
model = ddp.DistributedDataParallel(model, device_ids=[local_rank])

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
ds = load_dataset("EleutherAI/pile", split="train", streaming=True)
# ... standard PyTorch training loop with gradient accumulation ...

# Launch with:
#   torchrun --nproc_per_node=8 --nnodes=1 train_13b.py

Workload pattern C: Mixtral 8x7B serving with expert parallelism#

MoE serving on 2x MI300X with vLLM's expert-parallel scheduler. Mixtral 8x7B in BF16 (roughly 90 GB total weights) fits comfortably on one MI300X with KV-cache room to spare; the 2-GPU topology gives expert sharding for higher concurrent throughput at the cost of additional xGMI traffic on every routing decision.

bash

# Mixtral 8x7B BF16 on 2x MI300X with vLLM expert parallelism
HIP_VISIBLE_DEVICES=0,1 \
NCCL_P2P_LEVEL=NVL \
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 2 \
  --expert-parallel-size 2 \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --max-num-batched-tokens 32768 \
  --max-num-seqs 64 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --host 0.0.0.0 --port 8000

Sizing and capacity planning#

Sizing tables we use on InferenceBench for MI300X. All figures assume ROCm 6.3, vLLM 0.6+ with the ROCm backend, BF16 weights (or FP8 where flagged via AMD Quark) and xGMI-local placement. The headline against H100 SXM5: most memory-pressured rows collapse to fewer GPUs per replica thanks to the 192 GB pool; FLOPS-bound rows show MI300X sitting between H100 SXM5 and H100 PCIe Gen5.

Training rule of thumb: 1 trillion tokens x 70B parameters at BF16 needs roughly 350-450 MI300X-days on 64-GPU clusters — between H100 SXM5 (250-350) and H100 PCIe Gen5 (450-550).
Memory ceiling for a single MI300X: weights + KV cache + activations + cuBLAS-equivalent scratch < 188 GB usable. Above 188 GB expect OOMs.
For 500 RPS at 4K-token output, Llama 3.1 70B BF16 needs roughly 4-5 MI300X replicas vs 6-8 H100 SXM5 FP8 replicas — typical fleet compression ratio of 1.5x.
xGMI overhead at TP=8 inside one OAM baseboard: 10-14 % of step time for 70B BF16 — slightly higher than NVLink 4.0's 6-9 % on H100, reflecting the lower aggregate fabric bandwidth.
Spot/preemptible MI300X capacity is limited in 2026 — most providers reserve MI300X for committed customers.

Model size	Precision	Context	GPUs per replica	TP / EP	Approx output TPS	Approx VRAM headroom
7B (Mistral, Qwen)	BF16	8K	1x MI300X	1 / 1	4,500-5,800	175 GB free
7B (Mistral, Qwen)	FP8 (Quark)	8K	1x MI300X	1 / 1	7,000-8,500	180 GB free
13B	BF16	8K	1x MI300X	1 / 1	3,000-3,800	165 GB free
34B (Yi, Codestral)	BF16	8K	1x MI300X	1 / 1	1,500-1,900	120 GB free
70B (Llama 3.1)	BF16	8K	1x MI300X	1 / 1	1,200-1,500	50 GB free
70B (Llama 3.1)	FP8 (Quark)	32K	1x MI300X	1 / 1	1,800-2,200	100 GB free
70B (Llama 3.1)	FP8 (Quark)	128K	1x MI300X	1 / 1	1,900-2,400	60 GB free
140B MoE (Mixtral 8x22B)	BF16	32K	2x MI300X	2 / 2	1,400-1,800	60 GB free per rank
180B (Falcon, Bloom)	BF16	8K	2x MI300X	2 / 1	650-850	100 GB free per rank
405B (Llama 3.1)	BF16	32K	4x MI300X	4 / 1	300-400	60 GB free per rank

Cost & TCO#

MI300X pricing clears materially below H100 SXM5 at every commitment tier in 2026 — typically 15-25 % cheaper per GPU-hour on like-for-like hyperscaler / neocloud SKUs. The combination of 192 GB HBM3 (collapsing TP topologies) and lower hourly cost makes cost-per-token competitive with or better than H100 BF16 on memory-bound serving, and broadly competitive with H100 FP8 on FP8-quantised models via AMD Quark.

Cost-per-million-output-tokens on Llama 3.1 70B FP8 (Quark) at 32K context, 1x MI300X at $4.00/GPU-hr and 2,000 TPS sustained: roughly $0.55 per million tokens — competitive with H100 SXM5 FP8 ($0.50) and ~25 % cheaper than H200 FP8 ($0.74).
Commitment savings: 1y reserved ~= 25 % off on-demand, 3y reserved ~= 38 % off — track H100 closely.
FP8 via AMD Quark yields ~1.5-1.6x throughput vs BF16 at iso-quality on most chat models — same lever as Transformer Engine FP8 on NVIDIA.
Egress and inter-region data movement frequently exceed 10 % of total MI300X bill at hyperscalers — collocate model artefacts with compute.
Multi-vendor procurement value: an MI300X commitment alongside an H100 commitment hedges single-vendor supply risk and gives FinOps leverage at renewal time.

Provider class	SKU	On-demand $/GPU-hr	1y reserved	3y reserved	Notes
Azure	ND MI300X v5	$4.00	$3.00	$2.45	UK West, East US, Sweden Central regions.
Oracle Cloud	BM.GPU.MI300X.8	$3.95	$2.95	$2.40	AMD's launch partner; UK South region available.
Tier-1 neocloud (TensorWave, Crusoe)	MI300X OAM	$3.50	$2.75	$2.25	Frequently cheapest at scale; verify xGMI topology.
Tier-2 neocloud (Hot Aisle, Vultr, etc.)	MI300X OAM	$3.20	$2.55	$2.10	Best raw rate; expect more variance in fabric topology.
Spot/preemptible	MI300X OAM	$2.00-2.80 where available	n/a	n/a	Limited capacity; fine-tunes only.
Yobitel NeoCloud (UK + EU)	MI300X OAM	$3.40-3.70	$2.65-2.90	$2.20-2.40	NCSC OFFICIAL-aligned multi-vendor anchor; FOCUS-conformant billing.
Yobitel Omniscient Compute	MI300X multi-cloud	Market-clearing	Commit-discounted	Commit-discounted	Cross-provider arbitrage on top of NeoCloud + partner capacity.

All cost figures land on the FinOps Foundation FOCUS billing spec when consumed via Yobitel: ServiceName=`AcceleratorCompute`, ChargeCategory=`Usage`, SkuId=`gpu.mi300x.oam`. This is what makes cross-vendor (NVIDIA + AMD) cost attribution tractable.

Migration and alternatives#

When MI300X is the right choice and when it isn't. The dominant migrations are H100/H200 -> MI300X for memory-pressured serving on ROCm-compatible workloads, and MI300X -> MI325X / MI355X for AMD generation rolls. Two heuristics: choose MI300X when your serving stack is on vLLM, SGLang or Hugging Face Transformers and you want a multi-vendor supply hedge; stay on H-series when your production path is bound to TensorRT-LLM engines, ModelOpt FP4 calibration, or NVIDIA CC-on FedRAMP attestation.

From / to	When it pays	Migration effort	Key incompatibility
NVIDIA H100/H200 -> MI300X	Memory-pressured serving; multi-vendor strategy	High — CUDA -> ROCm rewrite	TensorRT-LLM, ModelOpt, CUDA kernels not portable
NVIDIA H100/H200 -> MI300X (vLLM-only)	Already on vLLM with no TRT-LLM dependency	Medium — ROCm wheels, RCCL, AMD Quark for FP8	Custom CUDA kernels; some quantisation tooling
MI300X -> MI325X	More HBM (256 GB HBM3e), refreshed binning	Trivial — same gfx942 generation, same ROCm baseline	None — same software
MI300X -> MI355X / MI400	Next-generation FP4, newer fabric	Medium — new gfx target, software lift	New gfx target; rebuild kernels
MI300X -> NVIDIA H-series (back-migrate)	FedRAMP CC-on requirement; supply correction	High — rewrite ROCm paths back to CUDA	RCCL -> NCCL; ROCm extensions
MI300X -> Intel Gaudi 2/3	Multi-vendor strategy beyond NVIDIA + AMD	Very high — third software stack	Habana SynapseAI vs ROCm vs CUDA

Pitfalls and operational notes#

Three categories account for the majority of MI300X production incidents: ROCm version pinning, RCCL behavioural differences from NCCL, and xGMI topology assumptions carried over from NVLink Switch System expectations. The MI300X has the silicon to deliver — the operational risk lives in the software baseline.

ROCm version pinning is the single highest-leverage operational discipline on MI300X. ROCm releases move faster than the data-centre community is used to from CUDA and break compatibility silently across minor versions. The pattern is: pin ROCm at 6.3.x as the 2026 baseline, pin matching torch wheels from AMD's wheel index, pin matching vLLM wheels from the same index, and treat the combination as one immutable artefact. Mixing ROCm 6.0 drivers with ROCm 6.3 user-space libraries is the single most common production outage on MI300X fleets. The `HIP error: invalid device function` symptom on kernel launch is the canonical fingerprint — it means the user-space libraries are newer than the kernel-space driver, or the custom kernels were compiled against the wrong `gfx` target. Rebuild custom kernels with `--amdgpu-target=gfx942` and tighten the version pins.

RCCL is API-compatible with NCCL but the topology autodetection behaviour differs enough to matter. RCCL AllReduce hangs at job start usually trace to topology autodetection failing on heterogeneous xGMI + Ethernet clusters; setting `RCCL_TOPO_FILE=/etc/rccl/topo.xml` (generate with `rccl-tests`) and verifying with `NCCL_DEBUG=INFO` (RCCL respects the NCCL environment variables) resolves it. xGMI aggregate bandwidth below 600 GB/s usually means a single xGMI link is down or mis-routed, not a baseboard fault — `rocm-smi --showxgmierr` plus a module reseat is the first step; persistent issues usually trace to a topology assumption (xGMI is a 7-link mesh, not the fully-connected NVSwitch fabric some teams expect).

Three software-stack pitfalls round out the high-frequency list. Flash Attention import failures on ROCm 6.0 happen because production-grade FA2 only landed reliably in ROCm 6.3+; upgrade or fall back to `attn_implementation='sdpa'`. `MIOpenStatusUnsupportedOp` during model load means the MIOpen autotuning database lacks entries for the model's specific op shapes — run with `MIOPEN_FIND_MODE=NORMAL`, let autotune run once, and persist `~/.config/miopen/`. FP8 quantisation accuracy regressions after AMD Quark trace to a too-small calibration dataset or AWQ block-size mismatch; increase `--calib-samples` to 2048 or higher and consider excluding attention QKV projections from FP8.

Inference throughput at half the expected rate almost always means vLLM is running in CUDA-emulation mode instead of the native ROCm backend — check `vllm --version` reports `rocm`, uninstall any mainline pip vLLM wheels, and reinstall from AMD's wheel index. Compute-partition mode flips (SPX to CPX) are destructive — drain workloads, mark the node unschedulable, change mode with `amd-smi set --compute-partition CPX`, then redeploy. And MI300X hosts must be vendor-pure: do not co-install NVIDIA drivers on the same kernel — kernel panics on driver upgrade trace to that pattern directly.

On isolation and compliance: MI300X exposes SR-IOV with up to 8 virtual functions per GPU and compute-partition modes (SPX exposes the full 304 CUs as one device; CPX exposes 8 partitions of 38 CUs each with isolated HBM allocations) for multi-tenant scheduling. There is no NVIDIA-CC-on-equivalent attested confidential-compute path in 2026 — for FedRAMP-Moderate sovereign workloads, NVIDIA H-series with CC-on remains the recommended SKU. UK NCSC and EU GDPR postures are achievable on the host environment but the GPU itself does not currently provide attested confidential-compute equivalents.

Where this fits in the Yobitel stack#

MI300X is Yobitel's primary multi-vendor anchor in 2026. Yobibyte — our AI-native managed platform — places memory-pressured serving workloads on MI300X pools where customer compliance permits and where the model is ROCm-compatible (the overwhelming majority of open-weight models are), falling back to NVIDIA H-series for FedRAMP-Moderate sovereign workloads, TensorRT-LLM-bound paths and ModelOpt FP4 calibration. The vLLM and PyTorch commands in this entry are exactly what Yobibyte reconciles under the hood; the customer specifies model, region, replica count and spend cap, and the platform selects the SKU and the FP precision — including which vendor.

Omniscient Compute — our cross-cloud capacity broker — indexes MI300X capacity across Azure ND MI300X v5, Oracle BM.GPU.MI300X.8, and Tier-1/Tier-2 neocloud partners (TensorWave, Crusoe, Hot Aisle, Vultr), normalises pricing onto the FinOps Foundation FOCUS spec, and arbitrages workloads to the cheapest region that meets the workspace's residency posture. Because MI300X frequently clears 15-25 % cheaper than H100 SXM5 at the same commitment tier, multi-vendor procurement via Omniscient Compute is one of the highest-impact FinOps levers in the stack — often the single largest cost-saving recommendation in customer onboarding reviews.

InferenceBench — our public, reproducible benchmarking harness — publishes MI300X throughput, latency and cost-per-token numbers for every major open-weight model across vLLM and SGLang on ROCm, with side-by-side H100/H200/B200 comparisons. The sizing tables above are anchored on InferenceBench runs. If you are sizing a 2026 multi-vendor footprint, start with InferenceBench to see where MI300X wins on cost-per-token and where it loses, lift the platform configuration into the Yobibyte workspace, and let Omniscient Compute pick the region and the vendor.

References

AMD Instinct MI300X Datasheet · AMD
AMD CDNA 3 Architecture Whitepaper · AMD
ROCm Documentation · AMD
AMD GPU Operator for Kubernetes · AMD
vLLM ROCm support guide · vLLM
AMD Quark Quantisation Toolchain · AMD
RCCL — ROCm Collective Communication Library · AMD
FinOps Foundation FOCUS billing specification · FinOps Foundation
NCSC Cloud Security Principles · UK NCSC

TL;DR

CDNA 3-based data centre GPU announced December 2023 (Advancing AI event), volume shipping Q1 2024. Eight XCD compute dies (304 Compute Units total) plus four IOD memory dies on a single 3D-stacked CoWoS package, with eight HBM3 stacks totalling 192 GB at 5.3 TB/s — the largest single-package HBM pool of its generation.
Native FP8 support via the Matrix Core engine: 1,307 TFLOPS BF16/FP16 (sparse), 2,614 TFLOPS FP8 (sparse), 2,614 TOPS INT8 (sparse). FLOPS sit between H100 SXM5 and H100 PCIe Gen5 in practice; serving wins come from the 192 GB pool, not raw throughput.
OAM form factor at 750 W TDP. The 8-GPU baseboard uses AMD Infinity Fabric (xGMI) at 896 GB/s aggregate per GPU — not equivalent to NVLink Switch System but enough for 8-GPU non-blocking topologies; multi-node fabrics use 400 Gb/s Ethernet or InfiniBand NDR.
Software stack: ROCm 6.x (6.3+ recommended), PyTorch ROCm backend, vLLM with first-class MI300X paths, SGLang MI300X engines added 2025, hipBLASLt + AITER (AMD's TensorRT-LLM equivalent), RCCL as the NCCL replacement. TensorRT-LLM does NOT run on AMD; CUDA kernels are not portable.
Pricing through 2026: roughly $4.00/GPU-hr on-demand, $3.00 one-year reserved, $2.45 three-year reserved. Materially cheaper than H100 SXM5 at the same commitment tier and the multi-vendor signal that anchors enterprise procurement strategy.

Overview#

How it works: CDNA 3 chiplet architecture and the Matrix Core engine#

Eight XCDs (CDNA 3 compute dies) + four IODs (memory/IO dies) on a 3D-stacked CoWoS package.
304 Compute Units total (38 CUs per XCD); 1,216 Matrix Cores (4 per CU); 19,456 stream processors.
Eight HBM3 stacks (24 GB each) totalling 192 GB at 5.3 TB/s.
256 MB AMD Infinity Cache on the IODs; bandwidth amplifier for memory-bound serving.
Native FP8 E4M3 / E5M2, BF16, FP16, TF32, FP64; 2:4 structured sparsity supported.
xGMI (Infinity Fabric over PCIe): 128 GB/s per link, 7 links per GPU on an OAM baseboard = 896 GB/s aggregate.
OAM form factor at 750 W TDP; air-cooled chassis viable up to 8 GPUs per node.

Subsystem	CDNA 3 detail	Practical consequence
Matrix Core (CDNA 3)	FP8 E4M3/E5M2, BF16, FP16, TF32, INT8, INT4	Inference parity with H100 Tensor cores on supported precisions.
Infinity Cache (256 MB on IODs)	Coherent L3-equivalent; KV-cache and weight reuse	Bandwidth amplifier; long-context decode sustains higher effective bandwidth than HBM3 raw.
Chiplet design	8 XCDs + 4 IODs 3D-stacked on CoWoS	Yield headroom; smaller dies; ships more transistors per dollar.
xGMI fabric	896 GB/s aggregate per GPU on 8-GPU OAM baseboard	Comparable to NVLink 4.0 inside one node; NOT equivalent to NVLink Switch System across nodes.
HBM3 stack count	8 stacks x 24 GB = 192 GB at 5.3 TB/s	Largest single-package HBM pool of 2024-2025; 70B-180B serving on 1 card.

Reference: full specification sheet#

Metric	MI300X (OAM)
Architecture	CDNA 3
Process	TSMC 5 nm (XCDs) + 6 nm (IODs)
Transistors	153 billion
Chiplets	8 XCDs + 4 IODs + 8 HBM3 stacks on CoWoS
Compute Units	304 (38 per XCD)
Matrix Cores	1,216 (4 per CU)
Stream processors	19,456
Infinity Cache (L3-equivalent)	256 MB
LDS (per CU)	64 KB
FP64 (Matrix)	163 TFLOPS
FP64 (Vector)	81 TFLOPS
FP32 (Matrix)	163 TFLOPS
FP32 (Vector)	81 TFLOPS
TF32 (Matrix, sparse)	653 TFLOPS
BF16 / FP16 (Matrix, sparse)	1,307 TFLOPS
BF16 / FP16 (Matrix, dense)	653 TFLOPS
FP8 (Matrix, sparse)	2,614 TFLOPS
FP8 (Matrix, dense)	1,307 TFLOPS
INT8 (Matrix, sparse)	2,614 TOPS
Memory	192 GB HBM3 (8 stacks x 24 GB)
Memory bandwidth	5.3 TB/s
TDP	750 W
xGMI (Infinity Fabric)	896 GB/s aggregate (7 links x 128 GB/s)
PCIe	Gen5 x16 (128 GB/s)
Form factor	OAM 1.5
Cooling	Air-cooled OAM chassis up to 8 GPUs / node
Software baseline	ROCm 6.3+ recommended
Compute capability (gfx target)	gfx942
Multi-tenant partitioning	SR-IOV (single root I/O virtualisation), compute-partition modes

Workload pattern A: Llama 3.1 70B BF16 on a single MI300X#

bash

# 70B BF16 on 1x MI300X with vLLM + ROCm 6.3
HIP_VISIBLE_DEVICES=0 vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1 \
  --dtype bfloat16 \
  --max-model-len 131072 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --host 0.0.0.0 --port 8000

# Optional: pre-quantise to FP8 with AMD Quark for ~1.6x throughput
quark-quantize \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --output ./Llama-3.1-70B-Instruct-FP8-AMD \
  --quant-format fp8 --calib-dataset wikitext-2

HIP_VISIBLE_DEVICES=0 vllm serve ./Llama-3.1-70B-Instruct-FP8-AMD \
  --tensor-parallel-size 1 \
  --quantization fp8 --kv-cache-dtype fp8_e5m2 \
  --max-model-len 131072 --max-num-seqs 64 \
  --gpu-memory-utilization 0.92

Workload pattern B: 8x MI300X training cluster#

RCCL is API-compatible with NCCL; PyTorch keeps the backend name 'nccl' for source compatibility.
Flash Attention 2 is fully ported to ROCm; Flash Attention 3 ports landed in ROCm 6.3 (production-acceptable from late 2025).
Megatron-Core supports ROCm with the same TP/PP/DP semantics as on NVIDIA; rebuild from source against ROCm 6.3.
MFU on training: 50-60 % typical on MI300X for 7B-13B BF16, slightly below H100's 60-65 % at iso-precision.

python

# train_13b.py — 13B BF16 pretraining on 8x MI300X with PyTorch + Megatron-LM
# Deps: pip install "torch==2.4.0+rocm6.3" megatron-core flash-attn

import os, torch
import torch.distributed as dist
import torch.nn.parallel as ddp
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

os.environ.setdefault("NCCL_DEBUG", "INFO")
# RCCL is API-compatible with NCCL; PyTorch uses 'nccl' as the backend name on ROCm too.
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)  # On ROCm, torch.cuda maps to HIP

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # ROCm port of FA2
).to(local_rank)
model = ddp.DistributedDataParallel(model, device_ids=[local_rank])

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
ds = load_dataset("EleutherAI/pile", split="train", streaming=True)
# ... standard PyTorch training loop with gradient accumulation ...

# Launch with:
#   torchrun --nproc_per_node=8 --nnodes=1 train_13b.py

Workload pattern C: Mixtral 8x7B serving with expert parallelism#

bash

# Mixtral 8x7B BF16 on 2x MI300X with vLLM expert parallelism
HIP_VISIBLE_DEVICES=0,1 \
NCCL_P2P_LEVEL=NVL \
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 2 \
  --expert-parallel-size 2 \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --max-num-batched-tokens 32768 \
  --max-num-seqs 64 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --host 0.0.0.0 --port 8000

Sizing and capacity planning#

Training rule of thumb: 1 trillion tokens x 70B parameters at BF16 needs roughly 350-450 MI300X-days on 64-GPU clusters — between H100 SXM5 (250-350) and H100 PCIe Gen5 (450-550).
Memory ceiling for a single MI300X: weights + KV cache + activations + cuBLAS-equivalent scratch < 188 GB usable. Above 188 GB expect OOMs.
For 500 RPS at 4K-token output, Llama 3.1 70B BF16 needs roughly 4-5 MI300X replicas vs 6-8 H100 SXM5 FP8 replicas — typical fleet compression ratio of 1.5x.
xGMI overhead at TP=8 inside one OAM baseboard: 10-14 % of step time for 70B BF16 — slightly higher than NVLink 4.0's 6-9 % on H100, reflecting the lower aggregate fabric bandwidth.
Spot/preemptible MI300X capacity is limited in 2026 — most providers reserve MI300X for committed customers.

Model size	Precision	Context	GPUs per replica	TP / EP	Approx output TPS	Approx VRAM headroom
7B (Mistral, Qwen)	BF16	8K	1x MI300X	1 / 1	4,500-5,800	175 GB free
7B (Mistral, Qwen)	FP8 (Quark)	8K	1x MI300X	1 / 1	7,000-8,500	180 GB free
13B	BF16	8K	1x MI300X	1 / 1	3,000-3,800	165 GB free
34B (Yi, Codestral)	BF16	8K	1x MI300X	1 / 1	1,500-1,900	120 GB free
70B (Llama 3.1)	BF16	8K	1x MI300X	1 / 1	1,200-1,500	50 GB free
70B (Llama 3.1)	FP8 (Quark)	32K	1x MI300X	1 / 1	1,800-2,200	100 GB free
70B (Llama 3.1)	FP8 (Quark)	128K	1x MI300X	1 / 1	1,900-2,400	60 GB free
140B MoE (Mixtral 8x22B)	BF16	32K	2x MI300X	2 / 2	1,400-1,800	60 GB free per rank
180B (Falcon, Bloom)	BF16	8K	2x MI300X	2 / 1	650-850	100 GB free per rank
405B (Llama 3.1)	BF16	32K	4x MI300X	4 / 1	300-400	60 GB free per rank

Cost & TCO#

Cost-per-million-output-tokens on Llama 3.1 70B FP8 (Quark) at 32K context, 1x MI300X at $4.00/GPU-hr and 2,000 TPS sustained: roughly $0.55 per million tokens — competitive with H100 SXM5 FP8 ($0.50) and ~25 % cheaper than H200 FP8 ($0.74).
Commitment savings: 1y reserved ~= 25 % off on-demand, 3y reserved ~= 38 % off — track H100 closely.
FP8 via AMD Quark yields ~1.5-1.6x throughput vs BF16 at iso-quality on most chat models — same lever as Transformer Engine FP8 on NVIDIA.
Egress and inter-region data movement frequently exceed 10 % of total MI300X bill at hyperscalers — collocate model artefacts with compute.
Multi-vendor procurement value: an MI300X commitment alongside an H100 commitment hedges single-vendor supply risk and gives FinOps leverage at renewal time.

Provider class	SKU	On-demand $/GPU-hr	1y reserved	3y reserved	Notes
Azure	ND MI300X v5	$4.00	$3.00	$2.45	UK West, East US, Sweden Central regions.
Oracle Cloud	BM.GPU.MI300X.8	$3.95	$2.95	$2.40	AMD's launch partner; UK South region available.
Tier-1 neocloud (TensorWave, Crusoe)	MI300X OAM	$3.50	$2.75	$2.25	Frequently cheapest at scale; verify xGMI topology.
Tier-2 neocloud (Hot Aisle, Vultr, etc.)	MI300X OAM	$3.20	$2.55	$2.10	Best raw rate; expect more variance in fabric topology.
Spot/preemptible	MI300X OAM	$2.00-2.80 where available	n/a	n/a	Limited capacity; fine-tunes only.
Yobitel NeoCloud (UK + EU)	MI300X OAM	$3.40-3.70	$2.65-2.90	$2.20-2.40	NCSC OFFICIAL-aligned multi-vendor anchor; FOCUS-conformant billing.
Yobitel Omniscient Compute	MI300X multi-cloud	Market-clearing	Commit-discounted	Commit-discounted	Cross-provider arbitrage on top of NeoCloud + partner capacity.

Migration and alternatives#

From / to	When it pays	Migration effort	Key incompatibility
NVIDIA H100/H200 -> MI300X	Memory-pressured serving; multi-vendor strategy	High — CUDA -> ROCm rewrite	TensorRT-LLM, ModelOpt, CUDA kernels not portable
NVIDIA H100/H200 -> MI300X (vLLM-only)	Already on vLLM with no TRT-LLM dependency	Medium — ROCm wheels, RCCL, AMD Quark for FP8	Custom CUDA kernels; some quantisation tooling
MI300X -> MI325X	More HBM (256 GB HBM3e), refreshed binning	Trivial — same gfx942 generation, same ROCm baseline	None — same software
MI300X -> MI355X / MI400	Next-generation FP4, newer fabric	Medium — new gfx target, software lift	New gfx target; rebuild kernels
MI300X -> NVIDIA H-series (back-migrate)	FedRAMP CC-on requirement; supply correction	High — rewrite ROCm paths back to CUDA	RCCL -> NCCL; ROCm extensions
MI300X -> Intel Gaudi 2/3	Multi-vendor strategy beyond NVIDIA + AMD	Very high — third software stack	Habana SynapseAI vs ROCm vs CUDA

Pitfalls and operational notes#

Where this fits in the Yobitel stack#

References

AMD Instinct MI300X Datasheet · AMD
AMD CDNA 3 Architecture Whitepaper · AMD
ROCm Documentation · AMD
AMD GPU Operator for Kubernetes · AMD
vLLM ROCm support guide · vLLM
AMD Quark Quantisation Toolchain · AMD
RCCL — ROCm Collective Communication Library · AMD
FinOps Foundation FOCUS billing specification · FinOps Foundation
NCSC Cloud Security Principles · UK NCSC

AMD Instinct MI300X Accelerator

Overview#

How it works: CDNA 3 chiplet architecture and the Matrix Core engine#

Reference: full specification sheet#

Workload pattern A: Llama 3.1 70B BF16 on a single MI300X#

Workload pattern B: 8x MI300X training cluster#

Workload pattern C: Mixtral 8x7B serving with expert parallelism#

Sizing and capacity planning#

Cost & TCO#

Migration and alternatives#

Pitfalls and operational notes#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

AMD Instinct MI300X Accelerator

Overview#

How it works: CDNA 3 chiplet architecture and the Matrix Core engine#

Reference: full specification sheet#

Workload pattern A: Llama 3.1 70B BF16 on a single MI300X#

Workload pattern B: 8x MI300X training cluster#

Workload pattern C: Mixtral 8x7B serving with expert parallelism#

Sizing and capacity planning#

Cost & TCO#

Migration and alternatives#

Pitfalls and operational notes#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel