TL;DR
- Open-source training framework from NVIDIA Applied Deep Learning Research, first released alongside the 2019 Megatron-LM paper (Shoeybi et al., arXiv:1909.08053). Apache 2.0 with NVIDIA-specific clauses; hosted at github.com/NVIDIA/Megatron-LM.
- Origin of tensor parallelism, sequence parallelism, selective activation recomputation, and interleaved 1F1B pipeline parallelism as practised today; refactored into Megatron Core, the library form embedded in NeMo Framework, NeMo-Aligner, NVIDIA Nemotron training, and most NVIDIA-partner pretraining stacks.
- Composes five parallelism dimensions — data, tensor, pipeline, sequence, context — plus Distributed Adam (ZeRO-1-style) and Transformer Engine FP8 / FP4 kernels. Sustains 40-60 percent of theoretical peak FLOPs on H100 clusters and has been demonstrated at 16,384 H100 scale.
- The codebase used (directly or via NeMo) for training GPT-3-class to GPT-4-class open models — Megatron-Turing NLG 530B, Llama-3 derivatives, Nemotron-4 340B, Falcon, BLOOM derivatives — and the empirical reference for the parallelism-strategy choices made by every other large-scale training framework.
Overview#
Megatron-LM is both a paper series (Shoeybi 2019, Narayanan 2021, Korthikanti 2022) and an open-source repository at github.com/NVIDIA/Megatron-LM. The first paper introduced tensor parallelism and showed 8.3B-parameter training inside a single DGX-2 (16x V100) box. The second introduced interleaved 1F1B pipeline parallelism and demonstrated near-linear weak scaling to 3,072 A100s for a 1T-parameter target. The third added sequence parallelism and selective activation recomputation, removing the last big activation-memory hotspot and cutting recompute cost by 5x.
Functionally, Megatron-LM is a CUDA-Python codebase wrapping PyTorch with custom autograd-aware collectives (NCCL AllReduce / AllGather / ReduceScatter / P2P), a Transformer Engine integration for FP8 and FP4 math on Hopper and Blackwell, a Distributed Adam optimiser that shards moment state across the DP group (ZeRO-1 in spirit), and a corpus of launch scripts under `examples/` that show how the pieces compose for GPT, BERT, T5, RETRO, Llama, Mistral, Mixtral and DeepSeek-style architectures.
By 2026 Megatron Core (the refactored library form, `megatron.core`) is the embedded engine inside NVIDIA NeMo Framework, NeMo-Aligner for SFT/DPO/RLHF, the NVIDIA Nemotron training pipeline, Pax and MaxText derivatives on Google Cloud, Colossal-AI's high-performance path, and many internal lab forks. If your training run is north of ~30B parameters and on NVIDIA hardware, you are almost certainly running Megatron primitives — directly or one wrapper away. Yobitel NeoCloud customers training 70B+ models commonly use Megatron-LM (or NeMo, which wraps Megatron Core) as the default pretraining engine on multi-node H100, H200, and B200 training pods in the UK and EU regions.
This entry documents the production surface: the CLI and flag set, the four parallelism axes plus optimiser sharding, the data-pipeline contract, sizing tables at the common scales (70B / 175B / 405B), the recommended Hopper and Blackwell recipes, and the migration paths from FSDP, DeepSpeed and NeMo. This entry helps you choose and operate Megatron-LM for training pods on Yobitel NeoCloud or your own multi-GPU cluster.
Quick start#
The example below pretrains a GPT-3-style 1.3B model on 8x H100 SXM5 using tensor parallelism within the node, BF16 weights, FlashAttention-3, sequence parallelism on the TP group, and the Megatron Distributed Adam optimiser. The first block clones Megatron-LM, installs apex and Transformer Engine, and preprocesses a sample corpus into the indexed binary format Megatron expects. The second block launches the pretraining run with `torchrun`. The third block converts the resulting checkpoint to HuggingFace format for downstream serving.
# 1. Clone Megatron-LM and prepare a sample indexed dataset
git clone https://github.com/NVIDIA/Megatron-LM && cd Megatron-LM
pip install -r requirements.txt
pip install "transformer-engine[pytorch]" apex # CUDA 12.4+
python tools/preprocess_data.py \
--input ./data/oscar-sample.jsonl \
--output-prefix ./data/oscar-sample \
--vocab-file ./vocab/gpt2-vocab.json \
--merge-file ./vocab/gpt2-merges.txt \
--tokenizer-type GPT2BPETokenizer \
--workers 32 --append-eod
# Produces oscar-sample_text_document.{bin,idx}
# 2. Pretrain a 1.3B GPT on 8x H100 with TP=2, DP=4, BF16, FA3, SP
GPUS_PER_NODE=8
torchrun --nproc_per_node=$GPUS_PER_NODE \
pretrain_gpt.py \
--num-layers 24 --hidden-size 2048 --num-attention-heads 16 \
--seq-length 4096 --max-position-embeddings 4096 \
--micro-batch-size 4 --global-batch-size 256 \
--train-iters 50000 --lr 2.0e-4 --min-lr 2.0e-5 \
--lr-decay-style cosine --lr-warmup-iters 2000 \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 1 \
--sequence-parallel \
--use-flash-attn \
--bf16 \
--use-distributed-optimizer \
--recompute-granularity selective \
--data-path ./data/oscar-sample_text_document \
--vocab-file ./vocab/gpt2-vocab.json \
--merge-file ./vocab/gpt2-merges.txt \
--save ./checkpoints/gpt-1b3 --load ./checkpoints/gpt-1b3 \
--save-interval 5000 --log-interval 10 \
--tensorboard-dir ./tb
# 3. Convert the Megatron checkpoint to HuggingFace format for serving
python tools/checkpoint/convert.py \
--model-type GPT --loader megatron --saver llama_mistral \
--load-dir ./checkpoints/gpt-1b3 \
--save-dir ./hf/gpt-1b3-hf \
--tokenizer-model ./vocab/gpt2-vocab.jsonUse `--use-distributed-optimizer` on day one. It shards Adam moment state across the DP group and is essentially free; without it, the optimiser state alone is roughly 8 bytes per parameter on every rank.
How it works#
Megatron organises a training run around five orthogonal parallelism axes that compose multiplicatively. World size = DP x TP x PP x CP (sequence parallel is a sub-mode of TP, not a separate axis). Each axis trades a different cost: TP buys per-layer compute scaling at the cost of high-bandwidth intra-block AllReduce; PP buys layer-count scaling at the cost of pipeline-bubble idle time; DP buys batch-size scaling at the cost of one gradient AllReduce per step; CP buys sequence-length scaling at the cost of attention-time P2P traffic.
Inside the forward pass, the standard transformer block becomes: (1) column-parallel QKV projection — every TP rank computes a slice; (2) FA3 attention with the head dimension sharded across the TP group; (3) row-parallel output projection followed by an AllReduce (replaced by ReduceScatter + AllGather when sequence parallel is on); (4) column-parallel up-projection; (5) GELU/SwiGLU; (6) row-parallel down-projection with the same collective. Sequence parallelism sharded LayerNorm and dropout activations along the sequence dimension and replaces the per-block AllReduce with AllGather+ReduceScatter — same total bytes, but the activation that gets replicated drops by N=TP.
Across nodes, pipeline parallelism partitions the layer stack into K stages. The interleaved 1F1B schedule chops each stage into virtual sub-stages, reducing the bubble fraction to (K-1)/(M+K-1) where M is the number of micro-batches per pipeline fill. A 405B model on 1,024 H100s with TP=8, PP=16 and M=128 typically holds the bubble below 5 percent. P2P sends/receives between adjacent stages are NCCL-backed and overlap with the next forward / backward, which is why PP tolerates 400Gb InfiniBand bandwidth where TP demands NVLink.
The Distributed Adam optimiser (ZeRO-1 in DeepSpeed nomenclature) shards the BF16 master copy, FP32 first-moment and second-moment buffers across the DP group. Each rank computes the update for its 1/DP slice and AllGathers the resulting parameter delta. This drops optimiser-state memory from ~12 bytes/param per rank to ~12/DP — for a 70B BF16 model with DP=32 that is 26 GB per rank instead of 840 GB.
- Tensor parallel group: shards weights of one transformer block, AllReduce per block. TP=2/4/8 inside one NVLink island.
- Pipeline parallel group: layers split into K stages, P2P sends across stages. Interleaved 1F1B for minimal bubble.
- Sequence parallel: TP-group sub-mode; shards LayerNorm/dropout activations along sequence dim. Free at long context.
- Context parallel: shards the sequence itself across CP ranks; required at L > ~64K.
- Data parallel: replicates everything else; AllReduce gradients each step. Distributed Adam shards the optimiser inside the DP group.
- Transformer Engine: BF16/FP16 default, FP8 (E4M3/E5M2) on Hopper and Blackwell, FP4 (MXFP4) on Blackwell.
- Selective recomputation: only the largest activations are dropped and recomputed; ~30 percent FLOPs overhead for 5-10x activation-memory savings.
The empirical rule from the 2021 Narayanan paper still holds: pick TP to fill the NVLink island (TP=8 on a DGX H100), then PP to fit the model in aggregate memory, then DP for the rest. Never use TP across InfiniBand; never use PP without enough micro-batches to keep the bubble small.
Reference and specifications#
Megatron-LM exposes its surface as a flag set on the `pretrain_gpt.py` / `pretrain_bert.py` / `pretrain_t5.py` entry-point scripts. The table below is the canonical reference for the flags that govern parallelism, precision, memory, and the optimiser as of Megatron-LM 0.8 / Megatron Core 0.11 (June 2026). Flags marked with an asterisk are also available as `TransformerConfig` fields in the Megatron Core library API.
| Flag | Type | Default | Description |
|---|---|---|---|
| --num-layers * | int | (required) | Number of transformer blocks in the model. |
| --hidden-size * | int | (required) | Hidden dimension d_model. |
| --num-attention-heads * | int | (required) | Total attention heads (must divide hidden-size and TP size). |
| --num-query-groups * | int | = heads | Number of KV heads for GQA / MQA. 1 = MQA, < heads = GQA. |
| --ffn-hidden-size * | int | 4 * hidden | Intermediate MLP dimension. |
| --seq-length * | int | (required) | Training sequence length in tokens. |
| --max-position-embeddings * | int | = seq-length | Position-embedding table size; relevant for absolute PE. |
| --position-embedding-type * | string | learned_absolute | learned_absolute | rope | alibi | none. |
| --rotary-percent * | float | 1.0 | Fraction of head_dim that gets RoPE rotation. |
| --swiglu * | bool | false | Use SwiGLU MLP (Llama / Mistral); raises FFN compute ~1.5x. |
| --normalization * | string | LayerNorm | LayerNorm | RMSNorm. |
| --tensor-model-parallel-size * | int | 1 | TP group size; shards each block's weights across N GPUs via NCCL. |
| --pipeline-model-parallel-size * | int | 1 | PP stage count; partitions layers across stages. |
| --virtual-pipeline-model-parallel-size | int | (off) | Interleaved 1F1B virtual stages; reduces bubble at high PP. |
| --context-parallel-size * | int | 1 | Shards the sequence dimension across CP ranks for very long context. |
| --sequence-parallel * | bool | false | Sharded LayerNorm/dropout activations within the TP group. |
| --expert-model-parallel-size * | int | 1 | MoE expert parallelism; shards experts across the EP group. |
| --num-experts * | int | (off) | Total MoE experts; presence enables Mixtral-style routing. |
| --moe-router-topk * | int | 2 | Top-K routing for MoE. |
| --micro-batch-size * | int | (required) | Per-rank batch within one forward; small (1-8) for big models. |
| --global-batch-size * | int | (required) | Global batch across DP; sets gradient-accumulation count. |
| --train-iters | int | (required) | Number of optimiser steps to train for. |
| --lr / --min-lr | float | (required) | Peak and floor learning rate for the schedule. |
| --lr-decay-style | string | linear | linear | cosine | inverse-square-root | constant. |
| --lr-warmup-iters | int | 0 | Linear warmup steps before the main schedule kicks in. |
| --clip-grad | float | 1.0 | Global gradient-norm clipping threshold. |
| --weight-decay | float | 0.01 | AdamW weight decay. |
| --adam-beta1 / --adam-beta2 | float | 0.9 / 0.999 | Adam exponential decay rates. |
| --fp16 * | bool | false | Mixed-precision training with FP16 + loss scaling. |
| --bf16 * | bool | false | Mixed-precision training with BF16 (preferred on Ampere+). |
| --fp8 * | string | (off) | hybrid | e4m3 | e5m2. Requires Transformer Engine + Hopper/Blackwell. |
| --fp8-amax-history-len | int | 1024 | Steps of amax history for FP8 scaling factors. |
| --use-flash-attn * | bool | false | Enable FlashAttention (FA2 on Ampere, FA3 on Hopper). |
| --use-distributed-optimizer * | bool | false | Distributed Adam (ZeRO-1 equivalent); shards optimiser state across DP. |
| --overlap-grad-reduce | bool | false | Overlap gradient ReduceScatter with backward compute. |
| --overlap-param-gather | bool | false | Overlap parameter AllGather with the next forward. |
| --recompute-granularity | string | (off) | selective | full. Selective recomputes only the largest activations. |
| --recompute-method | string | uniform | uniform | block. Block recomputes whole transformer blocks at once. |
| --recompute-num-layers | int | 1 | Number of layers per recompute group when --recompute-method=block. |
| --data-path * | string | (required) | Prefix to the preprocessed indexed dataset .bin/.idx. |
| --tokenizer-type | string | GPT2BPETokenizer | GPT2BPETokenizer | SentencePieceTokenizer | HuggingFaceTokenizer | Llama3Tokenizer. |
| --save / --load | path | (required) | Checkpoint save and resume directories. |
| --save-interval | int | (required) | Steps between checkpoint writes. |
| --ckpt-format | string | torch | torch | torch_dist. Use torch_dist for sharded async writes. |
| --tensorboard-dir | path | (off) | Enables TensorBoard logging. |
| --wandb-project | string | (off) | Enables Weights & Biases logging. |
`--use-flash-attn` is a hard requirement at scale. With it off, Megatron falls back to a fused PyTorch attention path that materialises the full N^2 attention matrix in HBM — for L=8192 BF16 that is over 1 GB per head per micro-batch and will OOM before any other config matters.
Workload patterns#
Three workload shapes cover the bulk of Megatron-LM production usage: dense LLM pretraining from random weights at the 7B / 70B / 405B scales, continued pretraining of an open model on domain data, and large-scale SFT on instruction corpora. Each has its own parallelism and precision profile, and each maps cleanly to a Yobitel NeoCloud training-pod size — Pattern A on a 32-node (256-GPU) H100 pod with InfiniBand NDR, Pattern B on a 16-node (128-GPU) pod, Pattern C on a single 8x H100 node.
Pattern A — Dense pretraining of a Llama-style 70B from random weights on a 256-GPU H100 cluster (the canonical Yobitel NeoCloud 32-node training pod). Pattern B — Continued pretraining of Llama-3.1 70B on 100B tokens of domain corpus, starting from a HuggingFace checkpoint converted to Megatron format, on a 16-node NeoCloud pod. Pattern C — Large-scale supervised fine-tuning of Llama-3.1 8B on a 5M-example instruction set using packing, on a single 8x H100 NeoCloud node.
# A — Llama-style 70B pretrain on 32 nodes x 8 H100 (256 GPUs)
# TP=8 (intra-node NVLink) x PP=4 (cross-node IB) x DP=8
torchrun --nproc_per_node=8 --nnodes=32 \
--rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:29500 \
pretrain_gpt.py \
--num-layers 80 --hidden-size 8192 --num-attention-heads 64 \
--num-query-groups 8 --ffn-hidden-size 28672 \
--seq-length 8192 --max-position-embeddings 8192 \
--position-embedding-type rope --swiglu --normalization RMSNorm \
--micro-batch-size 1 --global-batch-size 1024 \
--train-iters 480000 \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 4 \
--virtual-pipeline-model-parallel-size 5 \
--sequence-parallel --use-flash-attn \
--bf16 --fp8 hybrid \
--use-distributed-optimizer \
--overlap-grad-reduce --overlap-param-gather \
--recompute-granularity selective \
--lr 1.5e-4 --min-lr 1.5e-5 --lr-decay-style cosine --lr-warmup-iters 4000 \
--data-path ${DATA_PREFIX} \
--tokenizer-type Llama3Tokenizer \
--save ${CKPT_DIR} --load ${CKPT_DIR} --save-interval 2000 \
--ckpt-format torch_dist
# B — Continued pretrain of Llama-3.1 70B on domain corpus, 16 nodes
# Start from a HF -> Megatron-converted checkpoint
python tools/checkpoint/convert.py \
--model-type GPT --loader llama_mistral --saver megatron \
--load-dir ${HF_DIR} --save-dir ${MEGATRON_INIT} \
--target-tensor-parallel-size 8 --target-pipeline-parallel-size 2
torchrun --nproc_per_node=8 --nnodes=16 ... \
pretrain_gpt.py \
--load ${MEGATRON_INIT} --no-load-optim --no-load-rng \
--finetune \
--global-batch-size 512 --micro-batch-size 1 \
--train-iters 100000 --lr 5e-5 --min-lr 5e-6 \
--tensor-model-parallel-size 8 --pipeline-model-parallel-size 2 \
--sequence-parallel --use-flash-attn --bf16 \
--use-distributed-optimizer
# C — SFT of Llama-3.1 8B on instruction corpus, single 8x H100 node
# Use NeMo or NeMo-Aligner for the loss/packing surface
torchrun --nproc_per_node=8 \
examples/nlp/language_modeling/megatron_gpt_finetune.py \
--config-path conf --config-name megatron_gpt_sft \
model.restore_from_path=${LLAMA_8B_CKPT} \
model.tensor_model_parallel_size=2 \
model.pipeline_model_parallel_size=1 \
model.data.train_ds.packed_sequence=True \
model.data.train_ds.max_seq_length=8192 \
model.optim.lr=2e-6 \
trainer.max_steps=20000Sizing and capacity planning#
The two questions that drive Megatron sizing are: which parallelism shape fits the model, and how many GPUs do I need to finish in a fixed wall-clock? The table below gives reference parallelism configs and observed throughput on H100 SXM5 clusters with 400Gb InfiniBand NDR, BF16+FP8 mixed precision, FA3 attention, sequence parallel on, distributed optimiser on, and selective recomputation. Throughput is per-GPU sustained tokens-per-second from internal training-run telemetry and the published Nemotron and NVIDIA H100 MLPerf records; treat as planning anchors.
- Weak-scaling efficiency on H100 with the above recipe holds at 85-92 percent from 256 to 4,096 GPUs; above 4,096 it drops to 75-85 percent as DP AllReduce starts to dominate.
- Move to Blackwell (B200) and FP4 MXFP for a roughly 2.7x throughput uplift on dense 70B-class training versus the same shape on H100 FP8.
- MoE models add expert parallelism (EP) inside the TP group; EP=8 is the typical choice for 8-expert top-2 routing. Expert AllToAll dominates at EP>8 unless on NVLink Switch.
- Context parallelism becomes necessary above ~64K sequence length; below that, sequence parallel inside TP is the simpler choice.
| Model size | Cluster | TP x PP x DP | Global batch | Per-GPU tok/s | Days for 1T tokens |
|---|---|---|---|---|---|
| 8B (Llama-style) | 8x H100 | 1 x 1 x 8 | 1024 | 12,500 | 11.6 |
| 8B | 64x H100 | 1 x 1 x 64 | 1024 | 11,800 | 1.5 |
| 70B | 256x H100 | 8 x 4 x 8 | 1024 | 4,200 | 10.8 |
| 70B | 512x H100 | 8 x 4 x 16 | 1024 | 4,000 | 5.6 |
| 175B (GPT-3 class) | 1024x H100 | 8 x 8 x 16 | 1536 | 2,400 | 4.7 |
| 340B (Nemotron class) | 2048x H100 | 8 x 16 x 16 | 2304 | 1,500 | 3.8 |
| 405B (Llama-3.1 class) | 4096x H100 | 8 x 16 x 32 | 2304 | 1,250 | 2.3 |
| 405B on Blackwell | 1024x B200 | 8 x 8 x 16 | 2304 | 3,400 | 3.3 |
| 1T (frontier) | 8192x H100 | 8 x 32 x 32 | 4096 | 750 | 1.9 |
| 8x22B MoE (Mixtral class) | 512x H100 | 8 x 4 x 16 EP=8 | 1024 | 2,800 | 8.0 |
Run the official `examples/llama` recipes as your baseline before adapting. They encode flag values (--overlap-grad-reduce, --overlap-param-gather, --virtual-pipeline-model-parallel-size, --recompute-granularity selective) that are easy to forget and cost 10-20 percent of throughput when missed.
Limits and quotas#
Megatron itself has few hard limits; what bounds a run are GPU memory, NCCL group counts, and the data-loader contract. The table below summarises the constraints worth knowing before designing a parallelism shape.
| Constraint | Default / ceiling | How to manage |
|---|---|---|
| hidden_size divisible by TP | Required | Choose TP from {1,2,4,8} that divides hidden. |
| num_attention_heads divisible by TP | Required | Constrains TP for thin-head architectures. |
| num_query_groups divisible by TP | Required for GQA | If GQA=8, TP cannot exceed 8. |
| num_layers divisible by PP * VPP | Required | Choose PP so layers split evenly. |
| seq_length divisible by TP (with SP) | Required | Pad short batches; choose TP that divides seq. |
| seq_length divisible by CP | Required for CP | Choose CP from {1,2,4,8} that divides seq. |
| Micro-batch size | 1 typical at 70B+ | Larger MB inflates activations; PP needs >= PP micro-batches. |
| NCCL communicator count | World-size dependent | Set NCCL_COMM_ID, NCCL_NET_GDR_LEVEL; use NCCL >= 2.20. |
| Indexed dataset size | Per-file ~1TB | Split into multiple .bin/.idx files; Megatron concatenates. |
| Checkpoint size | Model + 12 bytes/param (Adam) | Use --ckpt-format torch_dist for sharded async writes. |
| FP8 amax history | 1024 steps | Raise for very long training; storage cost negligible. |
| Single-step wallclock | 30-90s typical at 70B | Iteration time should hold within +-3 percent in steady state. |
Observability#
Megatron-LM emits TensorBoard scalars and optional Weights & Biases logs covering per-step loss, learning rate, gradient norm, samples-per-second, TFLOPs-per-GPU, FP8 amax, optimiser-state norms, and (under --log-memory-to-tensorboard) GPU memory peaks per phase. The metrics that matter operationally at training-cluster scale are throughput stability, gradient-norm sanity, and FP8 scaling-factor health.
On Hopper / Blackwell, pair Megatron logs with NVIDIA DCGM exporter (`DCGM_FI_DEV_GPU_UTIL`, `DCGM_FI_DEV_MEM_COPY_UTIL`, `DCGM_FI_PROF_SM_ACTIVE`, `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`, `DCGM_FI_PROF_NVLINK_TX_BYTES`) and NCCL profiling (`NCCL_DEBUG=INFO`, `NCCL_DEBUG_SUBSYS=COLL`). The three signals that detect 90 percent of training-cluster problems are: iteration time variance, NCCL AllReduce p99 latency, and per-rank step-loss divergence.
- iteration-time / samples-per-second: holds within +-3 percent in steady state; >10 percent dip means stragglers, IB packet loss, or thermal throttling.
- TFLOPs-per-GPU: should hit 40-60 percent of the device peak; drops correlate with attention shape or recompute config.
- grad-norm: spikes above 10x baseline indicate divergence; correlate with LR schedule.
- loss curve: per-rank loss should match within numerical noise; divergence means a DP rank dropped or got bad data.
- FP8 amax_history_max: persistent saturation indicates clip-and-rescale should fire; never-saturating means FP8 is wasted.
- DCGM PIPE_TENSOR_ACTIVE: the most honest tensor-core utilisation signal; pair with SM_ACTIVE.
- NCCL_DEBUG=WARN in steady state; flip to INFO when investigating slow steps.
# Prometheus rules for a Megatron-LM pretraining job
groups:
- name: megatron-training
interval: 60s
rules:
- alert: MegatronIterationTimeRegression
expr: |
(avg_over_time(megatron:iteration_time_seconds[5m])
/ avg_over_time(megatron:iteration_time_seconds[1h] offset 30m)) > 1.10
for: 10m
labels: { severity: warning, team: training }
annotations:
summary: "Iteration time +10% vs 1h baseline on {{ $labels.job_name }}"
- alert: MegatronGradNormSpike
expr: megatron:grad_norm > 10 * avg_over_time(megatron:grad_norm[1h] offset 30m)
for: 5m
labels: { severity: warning }
annotations:
summary: "Grad-norm spike — investigate LR or data corruption"
- alert: MegatronTFLOPsCollapse
expr: avg_over_time(megatron:tflops_per_gpu[10m]) < 0.5 *
avg_over_time(megatron:tflops_per_gpu[1h] offset 30m)
for: 15m
labels: { severity: critical }
annotations:
summary: "Per-GPU TFLOPs halved — check NCCL, FA3 kernel, recompute config"
- alert: MegatronNCCLAllReduceP99
expr: histogram_quantile(0.99,
rate(nccl:allreduce_seconds_bucket[5m])) > 0.2
for: 10m
annotations:
summary: "NCCL AllReduce p99 > 200ms — fabric or topology regression"
- alert: MegatronFP8AmaxSaturation
expr: megatron:fp8_amax_history_max / megatron:fp8_amax_history_threshold > 0.95
for: 30m
annotations:
summary: "FP8 amax persistently saturated — recompute scale or fall back to BF16"Cost and FinOps#
Megatron-LM cost is almost entirely a function of two numbers: GPU-hour rate and per-GPU sustained throughput. The training-cost-per-trillion-tokens for a given model size is `(cluster_GPUs x hours x hourly_rate)` where hours = (1T / (per_GPU_tok_s x cluster_GPUs x 3600)). The table below uses Yobitel UK list pricing (June 2026) for on-demand H100 SXM5 ($3.10/GPU-hour), reserved H100 ($1.95), and B200 ($5.50 on-demand) and the throughput anchors from the Sizing table.
- Reserved capacity (1-3 year terms) drops the rate ~35-40 percent for predictable multi-month pretraining runs.
- Blackwell B200 doubles to triples per-GPU tok/s on dense models; the rate premium pays back within the first month of training.
- FP8 weights + FP8 activations cut wall-clock 1.5-1.8x on H100 versus pure BF16; FP4 on Blackwell another 1.5x.
- Checkpoint storage is a non-trivial line item — a 405B checkpoint with Adam state is ~5 TB; sharded async writes (`--ckpt-format torch_dist`) and pruning intermediate checkpoints matter.
- Failed/restarted iterations are the silent cost. A 10-percent restart rate doubles the schedule slip. Spend on storage, network, and DCGM monitoring before spending on more GPUs.
| Model | Cluster | Per-GPU tok/s | $/GPU-hour | USD per 1T tokens |
|---|---|---|---|---|
| 70B on H100 on-demand | 256x H100 | 4,200 | $3.10 | $521,000 |
| 70B on H100 reserved | 256x H100 | 4,200 | $1.95 | $328,000 |
| 175B on H100 reserved | 1024x H100 | 2,400 | $1.95 | $2.31M |
| 340B on H100 reserved | 2048x H100 | 1,500 | $1.95 | $7.40M |
| 405B on H100 reserved | 4096x H100 | 1,250 | $1.95 | $17.7M |
| 405B on B200 reserved | 1024x B200 | 3,400 | $3.45 | $2.86M |
| 1T on H100 reserved | 8192x H100 | 750 | $1.95 | $59.0M |
Security and compliance#
Megatron-LM has no built-in auth surface — it is a training script, not a serving system. Security controls apply at the cluster boundary: Slurm or Kubernetes job submission auth, private container registry for the NVIDIA / NeMo base images, encrypted weights-at-rest on the checkpoint filesystem (Lustre with kernel crypto or NetApp with NVE), and network isolation of the training fabric from public ingress. Training-time data does not leave the cluster; the only egress is checkpoint writes and metrics shipping.
For UK and EU sovereign training programmes — government foundation models, regulated-industry domain pretrains — Megatron-LM runs inside the same sovereign tenancies that satisfy NCSC Cloud Security Principles and G-Cloud 14 lot definitions. The framework itself is Apache 2.0 (with NVIDIA-specific clauses around brand and trademark) and has no telemetry call-home; the only external dependency at runtime is the W&B / TensorBoard sink if you configure one. For air-gapped environments, mirror the NGC and PyPI dependencies into the internal registry and pin Transformer Engine and Apex versions.
Reproducibility for compliance audits requires pinning the entire stack: Megatron-LM commit hash, Transformer Engine version, Apex version, CUDA, cuDNN, NCCL, driver, base image digest. NVIDIA NeMo's NGC containers ship the full pin set in `/etc/nvidia/container-versions.txt`.
Migration and alternatives#
Three migration paths dominate. From FSDP/HSDP: the trigger is usually scale — FSDP at FULL_SHARD is competitive through ~30B, manageable to ~70B with HSDP, but loses ground above that as the per-layer AllGather amortises poorly. The move to Megatron buys you true tensor parallelism (no per-layer parameter materialisation) and interleaved 1F1B pipeline. From DeepSpeed: similar capability surface but DeepSpeed leans on ZeRO-3 for memory while Megatron leans on TP+PP for compute scaling. From NeMo: NeMo wraps Megatron Core, so 'migrating to Megatron' from NeMo is really 'dropping a layer of recipe ergonomics' — useful when you need to change something NeMo's Hydra configs do not expose.
| From | Trigger to migrate | Effort | What you gain / lose |
|---|---|---|---|
| FSDP / HSDP (PyTorch) | Model > 70B, throughput plateau on Hopper | Medium — config + data pipeline rewrite | Gain TP+PP, FP8 via TE; lose PyTorch-native ergonomics. |
| DeepSpeed ZeRO-3 | Model > 100B or interconnect-bound at ZeRO-3 | Low-medium — similar mental model | Gain explicit 3D parallelism; lose ZeRO-Infinity offload. |
| NeMo Framework | Need flag not exposed by NeMo Hydra config | Low — same engine underneath | Gain config flexibility; lose recipe catalogue + SFT/RLHF wrappers. |
| JAX + Pax / MaxText | Switching from TPU to GPU at scale | High — different framework family | Gain CUDA-optimised path; lose JAX functional purity. |
| Colossal-AI | Specific optimisation that Colossal exposes (e.g. heterogeneous training) | Low — Colossal vendors Megatron primitives | Gain Megatron's stability; lose Colossal's research features. |
Do not migrate to Megatron-LM for fine-tuning or LoRA work. The framework is heavyweight by design (binary indexed datasets, Megatron checkpoint format, configuration sprawl). For SFT, DPO, LoRA, use NeMo (which wraps Megatron with recipes), torchtune, axolotl, or the HuggingFace trl stack. Megatron is the right tool when you are starting from random weights at 32+ GPUs.
Troubleshooting#
The error patterns below account for most production Megatron incidents observed on Yobitel-operated training fleets and the public NVIDIA NeMo issue tracker. Each row maps the observable symptom to the underlying mechanism and the minimum-viable fix.
| Symptom | Cause | Fix |
|---|---|---|
| NCCL hang on init | Mismatched NCCL versions across nodes or /dev/shm too small. | Pin NCCL >= 2.20 cluster-wide; mount /dev/shm >= 16GB; export NCCL_DEBUG=INFO. |
| torch.cuda.OOM during forward at step 0 | TP/PP shape allocates more activation than expected at peak. | Lower --micro-batch-size to 1; enable --recompute-granularity selective. |
| loss spikes to NaN after a few hundred steps | FP16 loss scaling collapsed (use --bf16) or gradient clipping off. | Switch to --bf16; set --clip-grad 1.0; check data for outliers. |
| TFLOPs drop suddenly mid-training | Background daemon stealing SMs, or thermal throttling on a rack. | Pin nvidia-smi clocks; check dmesg for ECC errors; correlate to DCGM thermals. |
| FP8 amax history all zeros | FP8 not actually enabled (TE not linked) or scale path bypassed. | Verify `--fp8 hybrid` accepted; check TE version compatibility; confirm via nvprof. |
| Pipeline bubble dominates step time | Too few micro-batches per pipeline fill (M < 4*K). | Raise --global-batch-size or lower --pipeline-model-parallel-size. |
| Per-rank loss diverges across DP group | DP rank received corrupted data shard or gradient AllReduce failed silently. | Enable --consistency-check; rerun preprocess; verify identical seeds across ranks. |
| Distributed Adam errors on resume | --use-distributed-optimizer flag flipped between save and load. | Always preserve optimiser flags across resumes; never mix sharded and replicated state. |
| Sequence parallel produces wrong gradients | Custom CUDA kernel assumes contiguous activation; SP shards it. | Rewrite kernel to take stride args, or disable SP for the affected block. |
| HuggingFace conversion produces gibberish | Tokenizer or weight-permutation mismatch (GQA, RoPE base, etc.). | Use the version-matched `tools/checkpoint/convert.py`; verify with a 10-token sanity prompt. |
| torchrun rendezvous fails on multi-node | MASTER_ADDR unreachable or firewall on port 29500. | Use c10d backend with explicit --rdzv_endpoint; open ports; verify hostname resolution. |
| DataLoader stalls at iteration boundary | Indexed dataset file on slow NFS; preprocess sharded for parallel reads. | Use Lustre or local NVMe; verify --num-workers; cache .idx files on each node. |
Where this fits in the Yobitel stack#
Megatron-LM (via NeMo Framework) is the standard training engine on Yobitel sovereign GPU tenancies for any pretraining or continued-pretraining workload above 32 GPUs. Yobitel ships NGC-derived NeMo + Megatron Core containers, pre-validated NCCL and InfiniBand configurations for the H100, H200, and B200 fleets, and reference Slurm + Pyxis launch scripts that encode the parallelism-shape recommendations from this entry's Sizing table.
For customers training UK or EU sovereign foundation models — government, defence, regulated-industry — Megatron runs inside London-1 and Frankfurt-1 tenancies satisfying NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat. Weights never leave the customer tenancy; the Yobitel control plane handles only metrics, scheduling, and health telemetry. Where SFT / DPO / RLHF is needed downstream, the same checkpoints flow into NeMo-Aligner on the same cluster without re-sharding.
Yobibyte, Yobitel's managed AI-native platform, exposes Megatron-trained models for inference via its vLLM and TensorRT-LLM endpoints; InferenceBench v3 then scores throughput and latency on the deployed checkpoints across the H100, H200, B200 and MI300X SKUs the customer rents. The training-to-serving handoff is one HF-format conversion away.
References
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism · arXiv (Shoeybi et al., 2019)
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM · arXiv (Narayanan et al., 2021)
- Reducing Activation Recomputation in Large Transformer Models · arXiv (Korthikanti et al., 2022)
- Megatron-LM on GitHub · GitHub (NVIDIA)
- Megatron Core documentation · NVIDIA
- NVIDIA Transformer Engine · GitHub (NVIDIA)
- Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B · arXiv (Smith et al., 2022)