TL;DR
- Mixed precision keeps an FP32 'master' copy of weights for stable optimiser updates while running forward/backward matmuls in a narrower dtype (BF16, FP16, FP8, or FP4) for ~2x-8x compute throughput and ~2x memory savings on activations.
- BF16 (8 exp + 7 mantissa, FP32 exponent range) is the default for transformer training from 2021 onward; FP16 (5 exp + 10 mantissa, IEEE half) requires dynamic loss scaling; FP8 (E4M3/E5M2 via Transformer Engine) is the Hopper+ standard; FP4 (MXFP4) is the Blackwell standard.
- Hardware support: FP16 since NVIDIA Volta (2017); BF16 since Ampere (2020); FP8 since Hopper (2022); FP4/MXFP4 since Blackwell (2024). PyTorch autocast, Apex, NeMo, Megatron, FSDP, and DeepSpeed all expose mixed precision as a one-flag policy.
- Yobibyte training defaults BF16 on H100/H200/A100 and offers opt-in FP8 on Hopper and Blackwell; the Yobitel NeoCloud reference recipes pin LayerNorm, softmax, and gradient reduction at FP32 regardless of working dtype.
Overview#
Single-precision (FP32) training works but wastes silicon — most neural-net math tolerates a narrower precision in forward and backward passes provided the optimiser update is performed at higher precision. Mixed precision exploits this contract: store a master copy of the weights in FP32, cast them to BF16/FP16 (or further down to FP8/FP4) for the matmul, cast gradients back to FP32 for the optimiser step. The compute throughput goes up because the tensor cores deliver 2x at FP16/BF16, 4x at FP8, and 8x at FP4 over FP32; the memory pressure on activations goes down by the same factor.
The technique was formalised by Micikevicius et al. (2017, arXiv:1710.03740) for FP16, then extended to BF16 once Google demonstrated TPU training in BF16 worked without loss scaling. The 2022 Transformer Engine paper introduced per-tensor scaled FP8 (E4M3/E5M2) on Hopper; the 2024 Blackwell architecture added MXFP4 (microscaled FP4) which trains some workloads end-to-end at 4-bit precision with periodic FP32 amax updates. By 2026 BF16 mixed precision is the dominant transformer training regime, with FP8 augmenting it on Hopper / Blackwell and FP4 increasingly used for the largest pretraining runs where the wall-clock saving compounds across months.
On Yobitel NeoCloud, the H100 SXM5 (Hopper) reference fleet supports BF16 and FP8; the H200 SXM5 adds HBM3e capacity without changing the precision matrix; the B200 (Blackwell) reference fleet adds FP4/MXFP4. Yobibyte's managed training and fine-tune services default BF16 on every supported SKU (H100/H200/A100) and offer opt-in FP8 on Hopper and Blackwell — customers do not need to wire up Transformer Engine or amax handling themselves. NeoCloud customers self-operating Megatron-LM, NeMo, or FSDP2 keep full access to the FP8/FP4 surface with NVIDIA's pre-validated container images.
This entry helps you decide which precision (BF16 / FP16 / FP8 / FP4) fits your hardware and workload, how to compose precision with the optimiser, where the FP32 footholds are mandatory, and how to debug the precision-related failure modes (loss scale collapse, FP8 amax saturation, LayerNorm underflow) that show up only in mixed-precision regimes.
How it works#
The standard mixed-precision recipe maintains three copies of state. (1) FP32 master weights, which live in the optimiser's parameter group and never get touched by forward/backward. (2) Working weights in the chosen narrow dtype (BF16, FP16, FP8, or FP4), cast from master at the start of each step or held as the canonical storage with FP32 reconstructed only inside the optimiser update. (3) FP32 optimiser state — Adam's first and second moments, which are too sensitive to tolerate narrow-precision storage. The forward and backward passes use the working weights and produce working-precision activations and gradients; the optimiser step uses FP32 throughout.
FP16 additionally requires loss scaling. The IEEE FP16 dynamic range is roughly 6e-5 to 6e4, with denormals filling a small additional range below. Many gradients in a transformer training step land below 6e-5 and silently underflow to zero in FP16. Multiplying the loss by a constant (typically 2^10 to 2^15) before backward shifts those gradients into the representable range; the gradients are scaled back by the same constant before the optimiser step. Dynamic loss scaling adjusts the constant up when no overflow is detected for N steps and down when overflow is detected. PyTorch's GradScaler implements this; DeepSpeed's `fp16.initial_scale_power` configures the starting exponent.
BF16, with its FP32-equivalent 8-bit exponent and only 7 mantissa bits, does not need loss scaling — the gradient values cannot underflow. The 7 mantissa bits provide ~1e-2 relative precision per value, which is sufficient for matmul accumulation when the accumulator is upcast to FP32. The trade-off is that BF16 is noisier per individual matmul output than FP16 with loss scaling, but the noise averages out across the millions of summands in any meaningful matmul. This is the reason BF16 became the default once Ampere shipped it natively.
FP8 uses two formats. E4M3 (4 exp + 3 mantissa, ~448 max) is used for the forward path where dynamic range is tighter. E5M2 (5 exp + 2 mantissa, ~57344 max) is used for the backward path where gradients span a wider range. Transformer Engine (NVIDIA's library) maintains per-tensor amax history — the running maximum absolute value of each activation tensor over a sliding window — and computes a per-tensor scaling factor that maps the tensor into the representable FP8 range. The scaling factor is updated every step (or every N steps with `--fp8-amax-history-len 1024`) so the FP8 representation tracks the workload's actual dynamic range.
FP4 (MXFP4 specifically, the microscaling format standardised in 2023) groups 32 consecutive FP4 values and assigns them a shared FP8 scaling factor. The format trades single-value precision for compactness; the working assumption is that nearby values in a tensor have similar magnitudes, so a shared scale is good enough. Blackwell's tensor cores execute MXFP4 natively. FP4 training is still a research-leading edge — production usage in 2026 is concentrated on the very largest pretraining runs where the wall-clock saving justifies the additional tuning cost.
Across all four formats, the FP32 footholds are non-negotiable: master weights, optimiser state, loss accumulation, LayerNorm and RMSNorm computations, softmax over attention scores, gradient reduction (the cross-rank ReduceScatter or AllReduce), and gradient clipping. Most frameworks handle these automatically — PyTorch autocast wraps the modules that need FP32 with `enabled=False` autocast contexts, Transformer Engine inserts FP32 amax computation alongside FP8 matmul, and FSDP's `MixedPrecisionPolicy.reduce_dtype=torch.float32` keeps the gradient reduction stable.
- Three storage tiers: FP32 master weights, narrow working weights (BF16/FP16/FP8/FP4), FP32 optimiser state.
- FP16 requires dynamic loss scaling because the 5-bit exponent cannot represent small gradients without underflow.
- BF16 has FP32 exponent range, no loss scaling needed; the 7 mantissa bits suffice when the matmul accumulator is FP32.
- FP8 (E4M3 forward, E5M2 backward) uses per-tensor amax-based scaling via Transformer Engine.
- FP4 (MXFP4) groups 32 values with a shared FP8 scale; Blackwell-native.
- FP32 footholds mandatory: master weights, optimiser state, LayerNorm, softmax, gradient reduction, gradient clipping.
- Hardware capability: FP16/BF16 on Volta+/Ampere+; FP8 on Hopper+; FP4 on Blackwell+.
Accumulating in FP16 (rather than upcasting partial sums to FP32 inside the matmul kernel) silently degrades training. NVIDIA Tensor Cores upcast by default on Volta/Ampere/Hopper, but custom CUDA kernels and some quantised inference paths skip the upcast. Verify your matmul kernel's accumulator dtype before deciding the loss curve is correct.
Variants and format trade-offs#
The table below summarises the precision formats relevant for training in 2026, with the bit allocation, dynamic range, scaling requirement, and the hardware that supports it.
| Format | Exp / Mantissa | Dynamic range | Scaling | Hardware (NVIDIA) |
|---|---|---|---|---|
| FP32 | 8 / 23 | ~1.2e-38 to 3.4e38 | Not needed | Universal |
| TF32 (tensor cores) | 8 / 10 | ~1.2e-38 to 3.4e38 | Not needed | Ampere+ tensor cores |
| FP16 (IEEE half) | 5 / 10 | ~6e-5 to 6.5e4 | Dynamic loss scaling required | Volta (2017) + |
| BF16 (Brain Float) | 8 / 7 | ~1.2e-38 to 3.4e38 | Not needed | Ampere (2020) + |
| FP8 E4M3 | 4 / 3 | ~2e-3 to 448 | Per-tensor amax scaling | Hopper (2022) + |
| FP8 E5M2 | 5 / 2 | ~6e-5 to 5.7e4 | Per-tensor amax scaling | Hopper (2022) + |
| FP4 (MXFP4) | 2 / 1 (E2M1) | ~3 distinct levels | Per-group (32 vals) shared FP8 scale | Blackwell (2024) + |
| INT8 (weight-only) | 8-bit integer | +-127 with FP scale | Per-channel or per-group | Universal (inference) |
When to use vs alternatives#
Use BF16 mixed precision by default for transformer training on Ampere, Hopper, or Blackwell. There is no modern training scenario where pure FP32 is the right choice; BF16 matches FP32 quality on almost every transformer recipe with 2x throughput and 2x activation-memory savings. Yobibyte's managed fine-tune service defaults BF16 on every supported SKU (H100/H200/A100), and the Yobitel NeoCloud pre-validated containers ship BF16 recipes for Llama, Qwen, Mistral, and DeepSeek-V3 derivatives.
Use FP16 only when forced by hardware (Volta or Turing without BF16 support) or by a specific framework constraint. The dynamic-loss-scaling complexity is no longer worth it on Ampere or newer. Computer-vision workloads with small dynamic range sometimes tolerate FP16 better than transformers, but BF16 is still the safer default.
Stack FP8 on top of BF16 on Hopper or Blackwell for an additional 1.5-1.8x throughput uplift on dense transformer training. FP8 is opt-in on Yobibyte (the customer toggles a flag in the recipe) because some workloads — small models below 7B, specific MoE configurations, training runs with very long sequences — see quality regressions that FP8's amax-based scaling does not fully absorb. The Yobitel NeoCloud reference recipes pin FP8 at `hybrid` mode (E4M3 forward, E5M2 backward) with `--fp8-amax-history-len 1024` and have run reproducibly to 70B scale.
Use FP4/MXFP4 on Blackwell for the very largest pretraining runs where the additional ~1.5x throughput uplift compounds across months. FP4 production usage in 2026 is still concentrated on frontier-scale runs; it requires careful loss-curve validation and is not a default for ordinary fine-tuning. Yobibyte does not yet expose FP4 in the customer-facing managed recipe surface as of 2026; NeoCloud self-operating customers running Megatron-LM with Transformer Engine can opt in.
The right ladder for a new training project in 2026 is: BF16 default, FP8 after the loss curve is validated end-to-end at BF16, FP4 only if the cluster economics require it. Skipping straight to FP8 or FP4 makes debugging precision-related divergence much harder because you can't tell whether the issue is the recipe or the precision.
Trade-offs and known limitations#
- FP16 loss-scale tuning is error-prone — the scale collapses to zero on any persistent overflow and silently halts learning. BF16 sidesteps this entirely.
- FP8 amax saturation: persistent amax saturation indicates the per-tensor scale is clipping; quality drops slowly without an obvious symptom. Monitor `fp8_amax_history_max` against the format's representable max.
- FP4 quality is workload-dependent — some architectures (deep narrow networks) and some loss surfaces (RLHF reward modelling) show outsized degradation. Validate against a BF16 baseline before committing.
- LayerNorm and softmax must run in FP32 — most frameworks handle this automatically (PyTorch autocast, Transformer Engine, FSDP MixedPrecisionPolicy.reduce_dtype), but custom kernels need explicit casts. RMSNorm has similar requirements.
- Accumulating in narrow precision (rather than upcasting partial sums to FP32) silently degrades training; check the matmul kernel's accumulator dtype.
- Gradient clipping and weight decay should reference FP32 master weights, not the narrow working copy.
- Quantisation-aware training (QAT) for FP8/FP4 sometimes diverges if amax history is too short; the `--fp8-amax-history-len` default of 1024 steps is a starting point, not a guarantee.
- Mixed-precision regime interacts with FSDP / DeepSpeed reduce-dtype: setting `reduce_dtype=BF16` inside an already-noisy BF16 recipe can compound numerical error; FP32 reduction is the safe default.
Practical implementation notes#
PyTorch exposes mixed precision through `torch.amp.autocast(dtype=torch.bfloat16)` and the older `torch.cuda.amp.GradScaler` (for FP16 loss scaling). For FSDP2, `MixedPrecisionPolicy(param_dtype=torch.bfloat16, reduce_dtype=torch.float32)` is the production default. For DeepSpeed, the `bf16.enabled: true` block in the JSON config handles BF16 throughout, and `fp16.enabled: true` with `initial_scale_power: 16` is the FP16 fallback. Megatron-LM exposes `--bf16` and `--fp16` flags; for FP8, add `--fp8 hybrid --fp8-amax-history-len 1024 --transformer-impl transformer_engine`. NeMo wraps the same surface with Hydra configs.
On Yobitel NeoCloud, the pre-validated training containers (FSDP2, DeepSpeed, Megatron-LM, NeMo) ship with BF16 as the recipe default and FP8 as a one-flag toggle on Hopper / Blackwell. Yobibyte's managed fine-tune service defaults BF16 on H100/H200/A100 and exposes FP8 as a per-recipe option behind a customer toggle; FP4 is not yet exposed in the managed surface. The platform handles Transformer Engine and amax history automatically for managed customers, and pins the FP32 footholds (LayerNorm, softmax, reduction) without customer configuration.
- PyTorch FSDP2: `MixedPrecisionPolicy(param_dtype=torch.bfloat16, reduce_dtype=torch.float32)`.
- DeepSpeed: `{"bf16": {"enabled": true}}` in the config JSON.
- Megatron-LM: `--bf16` for BF16; `--bf16 --fp8 hybrid --transformer-impl transformer_engine` for FP8 on Hopper.
- Transformer Engine: `te.fp8_autocast(enabled=True, fp8_recipe=DelayedScaling(margin=0, amax_history_len=1024))`.
- HuggingFace Trainer: `TrainingArguments(bf16=True)` or `bf16_full_eval=True` for evaluation in BF16.
- JAX: `jax.numpy.bfloat16` plus `flax.linen.Dense(param_dtype=jnp.bfloat16)`.
- Observability: track loss-scale value (FP16), grad-norm, fp8_amax_history_max, and per-step iteration time — all four detect precision-related issues before quality drops show up.
Where mixed precision sits in the Yobitel stack#
Mixed precision is the precision contract on every Yobitel NeoCloud training pod and every Yobibyte managed training and fine-tune deployment. The reference recipes default BF16 on H100/H200/A100 with FP32 master weights, FP32 optimiser state, FP32 LayerNorm/softmax, and FP32 gradient reduction — the precision footholds that keep training numerically stable. FP8 is opt-in on Hopper and Blackwell (one flag in Yobibyte's recipe surface, three flags in Megatron-LM); FP4 stays NeoCloud-self-operating only as of 2026.
Yobibyte's managed fine-tune service handles Transformer Engine wiring, amax history, and the FP32 footholds for customers — the precision toggle is one boolean in the workspace API. NeoCloud customers who self-operate (Megatron-LM, NeMo, FSDP2, DeepSpeed) keep full access to the precision matrix and the NVIDIA-validated container images with Transformer Engine pre-installed. Across both paths, the precision choice never crosses sovereignty boundaries: BF16 weights stay in the customer's sovereign region throughout the training lifecycle, and the FP8/FP4 amax statistics are treated as part of the same data class as the weights themselves.
References
- Mixed Precision Training · arXiv (Micikevicius et al., 2017)
- FP8 Formats for Deep Learning · arXiv (Micikevicius et al., 2022)
- PyTorch automatic mixed precision · PyTorch
- NVIDIA Transformer Engine documentation · NVIDIA
- NVIDIA Mixed Precision Training Best Practices · NVIDIA
- OCP Microscaling Formats (MXFP4 / MXFP8) · Open Compute Project (2023)