FP4 Training

TL;DR

FP4 (E2M1 — 2 exponent, 1 mantissa, plus shared microscaling) is the new tensor-core format introduced with NVIDIA Blackwell in 2024.
Uses block-level microscaling (MX formats, OCP-standard) — each block of 32 elements shares an E8M0 scale, recovering effective dynamic range despite the narrow per-element format.
Training in pure FP4 is still experimental in 2026; current production usage is mostly FP8 + FP4 hybrid, or FP4-aware quantisation-aware training.

Overview#

Blackwell's tensor cores added native FP4 support — 4-bit floating point with per-block shared exponent. The OCP Microscaling (MX) formats standardised this approach: MXFP4 groups 32 elements into a block, with one E8M0 scale per block, giving effective precision close to FP6 at half the storage.

Marketing FLOPS figures for Blackwell are stated in FP4 (e.g. 20 PFLOPS sparse for B200) — that doubles the FP8 figure. Whether real training workloads achieve that ratio is an open question; the public record in 2026 is mixed for full pretraining and positive for inference and post-training quantisation.

Mechanism#

Microscaled FP4 stores 32 elements as 32 FP4 values plus a shared E8M0 scale (256 distinct power-of-two scales). The shared scale captures the gross dynamic range of the block; the per-element FP4 captures the fine-grained variation. The Transformer Engine handles the bookkeeping — selecting block scales, casting between FP4 and higher precision, fusing the dequantisation with the next operation.

In training, FP4 is typically used for the matmul on weights and activations while gradients and master state stay in higher precision. As with FP8, the optimiser does not see FP4 — it sees FP32 master weights updated from FP32 gradients.

Performance Characteristics#

Compute: 2× tensor-core throughput vs FP8 on Blackwell.
Memory: activations and weight working copies quarter relative to BF16.
Quality: convergence parity with BF16 demonstrated for some workloads, gap remains for others — actively researched.
Hardware: NVIDIA Blackwell (B100, B200, GB200) and rumoured on AMD MI400-generation.

When to Use#

As of mid-2026, FP4 is production-ready for inference and post-training quantisation; FP4 pretraining is still being validated across architectures. Conservative recipe: pretrain in FP8 with selective FP4 layers, or perform an FP4 quantisation-aware fine-tune at the end. Aggressive recipe: pure FP4 pretraining on Blackwell, validating against a smaller FP8 reference run. Expect the conservative side to dominate production runs through 2026.

Treat the marketing FP4 FLOPS as a peak number, not a delivered number. Real-world training throughput in FP4 depends heavily on how much of the network actually runs in FP4 — typically the matmul-dominated layers, leaving 20-40 % of the runtime untouched.

Pitfalls#

Microscaling block size and granularity interact with model structure — not every layer survives FP4 cleanly.
Custom kernels are often FP8-only and need re-tuning for FP4.
Public reference recipes are still scarce in 2026 — expect more configuration tuning than FP8 required.
FP4 inference does not imply FP4 training — validate the pretraining path separately before committing.