FP8 Training

TL;DR

Two FP8 formats — E4M3 (4 exponent, 3 mantissa) for forward and weight, E5M2 (5 exponent, 2 mantissa) for backward gradients — together cover the dynamic range of transformer training.
Hardware support: native on NVIDIA Hopper (H100, H200), Blackwell (B100, B200, GB200), AMD MI300, and AWS Trainium 2.
Used in production for Llama 3, Nemotron, DeepSeek-V3 and most frontier 2024+ pretrains, typically delivering 1.5-1.8× training throughput over BF16 at iso-quality.

Overview#

Hopper's FP8 support introduced two 8-bit floating point formats standardised by NVIDIA, Arm, and Intel (Micikevicius et al., 2022, arXiv:2209.05433). E4M3 has more mantissa precision but limited range, making it suitable for the forward pass and weights. E5M2 has FP16-like range but coarser precision, matching the larger dynamic spread of backward gradients. Together they cover the precision-range requirements of transformer training.

Unlike FP16 / BF16, FP8 requires per-tensor scaling to fit each tensor into the format's narrow range. NVIDIA's Transformer Engine tracks running statistics (amax — absolute maximum) of each FP8 tensor and rescales on the fly. This adds bookkeeping but is largely transparent at the framework level.

Mechanism#

A typical FP8 transformer block does the matmul in FP8 (with FP32 accumulation), keeps activations in FP8 between layers, and stores weights as FP8 with an FP32 master copy. LayerNorm, softmax, and the optimiser run at higher precision (BF16 or FP32) as before. The pattern is mixed precision extended one rung lower.

Each FP8 tensor carries a scale factor — typically a single FP32 value or a per-block scale. Before quantising to FP8, the tensor is divided by the scale; on dequantisation, it is multiplied back. The Transformer Engine library uses a delayed-scaling scheme: it observes the amax of the current iteration's outputs and applies that scale to the next iteration's inputs, which avoids the cost of a synchronous reduction.

FP8 is a forward + backward MMA optimisation. Optimiser state and master weights still live in higher precision — there is no such thing as 'FP8 Adam'. Memory savings come from activations and KV cache, not from the optimiser.

Performance Characteristics#

Compute: 2× tensor-core throughput vs BF16 on Hopper (3,958 TFLOPS FP8 sparse vs 1,979 TFLOPS BF16 sparse on H100 SXM5).
Memory: activations halve; per-token KV cache halves.
Quality: typically within 0.1-0.3 pp of BF16 on standard benchmarks; well-publicised production runs (Llama 3, Nemotron) showed iso-quality.
End-to-end training speedup: 1.4-1.8× over BF16 in real pretraining workloads, accounting for non-FP8 overhead.

When to Use#

Use FP8 for pretraining and large fine-tunes on Hopper or Blackwell hardware. The Transformer Engine integration in Megatron-LM, NeMo, and DeepSpeed makes adoption mostly a configuration flip. For inference on Hopper, FP8 KV cache and FP8 matmul are now standard (vLLM, TensorRT-LLM, SGLang all support it).

Stay on BF16 when running on Ampere or older hardware (no FP8 tensor cores), on workloads where the model has unusual numerical sensitivity (some scientific-ML and small-model training), or in environments where the extra scaling-state complexity is not worth the throughput gain.

Pitfalls#

Delayed scaling occasionally produces spikes early in training — warm up with BF16 for the first 1-2 % of steps and switch to FP8.
Custom CUDA kernels need explicit FP8 support — autocast does not magically make them FP8.
Reduction operations (softmax denominators, layer-norm statistics) must accumulate in FP32 — never in FP8.
Convergence parity is well-established for LLM pretraining but less so for diffusion or scientific ML; always validate against a BF16 baseline run.
FP8 inference adoption is broader than FP8 training adoption — many model checkpoints are BF16, requiring a calibration step before FP8 serving.

Software#

NVIDIA Transformer Engine (github.com/NVIDIA/TransformerEngine) — reference implementation, exposes FP8 layers as PyTorch modules.
Megatron-LM and NeMo both integrate Transformer Engine for FP8 pretraining.
DeepSpeed FP8 support via Transformer Engine plugin.
PyTorch native FP8 (torch.float8) is maturing; usable in research, not yet a Transformer Engine replacement.
Inference: vLLM, TensorRT-LLM, SGLang all support FP8 weight + KV cache for serving.

References

FP8 Formats for Deep Learning · arXiv (Micikevicius et al., 2022)
Transformer Engine documentation · NVIDIA
Using FP8 with Transformer Engine · GitHub (NVIDIA)
DeepSeek-V3 Technical Report (FP8 in production) · arXiv (DeepSeek-AI, 2024)

Overview#

Mechanism#

Performance Characteristics#

Compute: 2× tensor-core throughput vs BF16 on Hopper (3,958 TFLOPS FP8 sparse vs 1,979 TFLOPS BF16 sparse on H100 SXM5).

Memory: activations halve; per-token KV cache halves.

Quality: typically within 0.1-0.3 pp of BF16 on standard benchmarks; well-publicised production runs (Llama 3, Nemotron) showed iso-quality.

End-to-end training speedup: 1.4-1.8× over BF16 in real pretraining workloads, accounting for non-FP8 overhead.

When to Use#

Pitfalls#

Delayed scaling occasionally produces spikes early in training — warm up with BF16 for the first 1-2 % of steps and switch to FP8.

Custom CUDA kernels need explicit FP8 support — autocast does not magically make them FP8.

Reduction operations (softmax denominators, layer-norm statistics) must accumulate in FP32 — never in FP8.

Convergence parity is well-established for LLM pretraining but less so for diffusion or scientific ML; always validate against a BF16 baseline run.

FP8 inference adoption is broader than FP8 training adoption — many model checkpoints are BF16, requiring a calibration step before FP8 serving.

Software#

NVIDIA Transformer Engine (github.com/NVIDIA/TransformerEngine) — reference implementation, exposes FP8 layers as PyTorch modules.

Megatron-LM and NeMo both integrate Transformer Engine for FP8 pretraining.

DeepSpeed FP8 support via Transformer Engine plugin.

PyTorch native FP8 (torch.float8) is maturing; usable in research, not yet a Transformer Engine replacement.

Inference: vLLM, TensorRT-LLM, SGLang all support FP8 weight + KV cache for serving.

FP8 Training

Overview#

Mechanism#

Performance Characteristics#

When to Use#

Pitfalls#

Software#

References

Browse all entries

Deploy on Yobitel

FP8 Training

Overview#

Mechanism#

Performance Characteristics#

When to Use#

Pitfalls#

Software#

References

Browse all entries

Deploy on Yobitel