Megatron-LM

TL;DR

Open-source training framework from NVIDIA Applied Deep Learning Research, first released alongside the 2019 Megatron-LM paper (Shoeybi et al., arXiv:1909.08053). Apache 2.0 with NVIDIA-specific clauses; hosted at github.com/NVIDIA/Megatron-LM.
Origin of tensor parallelism, sequence parallelism, selective activation recomputation, and interleaved 1F1B pipeline parallelism as practised today; refactored into Megatron Core, the library form embedded in NeMo Framework, NeMo-Aligner, NVIDIA Nemotron training, and most NVIDIA-partner pretraining stacks.
Composes five parallelism dimensions — data, tensor, pipeline, sequence, context — plus Distributed Adam (ZeRO-1-style) and Transformer Engine FP8 / FP4 kernels. Sustains 40-60 percent of theoretical peak FLOPs on H100 clusters and has been demonstrated at 16,384 H100 scale.
The codebase used (directly or via NeMo) for training GPT-3-class to GPT-4-class open models — Megatron-Turing NLG 530B, Llama-3 derivatives, Nemotron-4 340B, Falcon, BLOOM derivatives — and the empirical reference for the parallelism-strategy choices made by every other large-scale training framework.

Overview#

Megatron-LM is both a paper series (Shoeybi 2019, Narayanan 2021, Korthikanti 2022) and an open-source repository at github.com/NVIDIA/Megatron-LM. The first paper introduced tensor parallelism and showed 8.3B-parameter training inside a single DGX-2 (16x V100) box. The second introduced interleaved 1F1B pipeline parallelism and demonstrated near-linear weak scaling to 3,072 A100s for a 1T-parameter target. The third added sequence parallelism and selective activation recomputation, removing the last big activation-memory hotspot and cutting recompute cost by 5x.

Functionally, Megatron-LM is a CUDA-Python codebase wrapping PyTorch with custom autograd-aware collectives (NCCL AllReduce / AllGather / ReduceScatter / P2P), a Transformer Engine integration for FP8 and FP4 math on Hopper and Blackwell, a Distributed Adam optimiser that shards moment state across the DP group (ZeRO-1 in spirit), and a corpus of launch scripts under `examples/` that show how the pieces compose for GPT, BERT, T5, RETRO, Llama, Mistral, Mixtral and DeepSeek-style architectures.

By 2026 Megatron Core (the refactored library form, `megatron.core`) is the embedded engine inside NVIDIA NeMo Framework, NeMo-Aligner for SFT/DPO/RLHF, the NVIDIA Nemotron training pipeline, Pax and MaxText derivatives on Google Cloud, Colossal-AI's high-performance path, and many internal lab forks. If your training run is north of ~30B parameters and on NVIDIA hardware, you are almost certainly running Megatron primitives — directly or one wrapper away. Yobitel NeoCloud customers training 70B+ models commonly use Megatron-LM (or NeMo, which wraps Megatron Core) as the default pretraining engine on multi-node H100, H200, and B200 training pods in the UK and EU regions.

This entry documents the production surface: the CLI and flag set, the four parallelism axes plus optimiser sharding, the data-pipeline contract, sizing tables at the common scales (70B / 175B / 405B), the recommended Hopper and Blackwell recipes, and the migration paths from FSDP, DeepSpeed and NeMo. This entry helps you choose and operate Megatron-LM for training pods on Yobitel NeoCloud or your own multi-GPU cluster.

Quick start#

The example below pretrains a GPT-3-style 1.3B model on 8x H100 SXM5 using tensor parallelism within the node, BF16 weights, FlashAttention-3, sequence parallelism on the TP group, and the Megatron Distributed Adam optimiser. The first block clones Megatron-LM, installs apex and Transformer Engine, and preprocesses a sample corpus into the indexed binary format Megatron expects. The second block launches the pretraining run with `torchrun`. The third block converts the resulting checkpoint to HuggingFace format for downstream serving.

bash

# 1. Clone Megatron-LM and prepare a sample indexed dataset
git clone https://github.com/NVIDIA/Megatron-LM && cd Megatron-LM
pip install -r requirements.txt
pip install "transformer-engine[pytorch]" apex  # CUDA 12.4+

python tools/preprocess_data.py \
    --input ./data/oscar-sample.jsonl \
    --output-prefix ./data/oscar-sample \
    --vocab-file ./vocab/gpt2-vocab.json \
    --merge-file ./vocab/gpt2-merges.txt \
    --tokenizer-type GPT2BPETokenizer \
    --workers 32 --append-eod
# Produces oscar-sample_text_document.{bin,idx}

# 2. Pretrain a 1.3B GPT on 8x H100 with TP=2, DP=4, BF16, FA3, SP
GPUS_PER_NODE=8
torchrun --nproc_per_node=$GPUS_PER_NODE \
    pretrain_gpt.py \
    --num-layers 24 --hidden-size 2048 --num-attention-heads 16 \
    --seq-length 4096 --max-position-embeddings 4096 \
    --micro-batch-size 4 --global-batch-size 256 \
    --train-iters 50000 --lr 2.0e-4 --min-lr 2.0e-5 \
    --lr-decay-style cosine --lr-warmup-iters 2000 \
    --tensor-model-parallel-size 2 \
    --pipeline-model-parallel-size 1 \
    --sequence-parallel \
    --use-flash-attn \
    --bf16 \
    --use-distributed-optimizer \
    --recompute-granularity selective \
    --data-path ./data/oscar-sample_text_document \
    --vocab-file ./vocab/gpt2-vocab.json \
    --merge-file ./vocab/gpt2-merges.txt \
    --save ./checkpoints/gpt-1b3 --load ./checkpoints/gpt-1b3 \
    --save-interval 5000 --log-interval 10 \
    --tensorboard-dir ./tb

# 3. Convert the Megatron checkpoint to HuggingFace format for serving
python tools/checkpoint/convert.py \
    --model-type GPT --loader megatron --saver llama_mistral \
    --load-dir ./checkpoints/gpt-1b3 \
    --save-dir ./hf/gpt-1b3-hf \
    --tokenizer-model ./vocab/gpt2-vocab.json

Use `--use-distributed-optimizer` on day one. It shards Adam moment state across the DP group and is essentially free; without it, the optimiser state alone is roughly 8 bytes per parameter on every rank.

How it works#

Megatron organises a training run around five orthogonal parallelism axes that compose multiplicatively. World size = DP x TP x PP x CP (sequence parallel is a sub-mode of TP, not a separate axis). Each axis trades a different cost: TP buys per-layer compute scaling at the cost of high-bandwidth intra-block AllReduce; PP buys layer-count scaling at the cost of pipeline-bubble idle time; DP buys batch-size scaling at the cost of one gradient AllReduce per step; CP buys sequence-length scaling at the cost of attention-time P2P traffic.

Inside the forward pass, the standard transformer block becomes: (1) column-parallel QKV projection — every TP rank computes a slice; (2) FA3 attention with the head dimension sharded across the TP group; (3) row-parallel output projection followed by an AllReduce (replaced by ReduceScatter + AllGather when sequence parallel is on); (4) column-parallel up-projection; (5) GELU/SwiGLU; (6) row-parallel down-projection with the same collective. Sequence parallelism sharded LayerNorm and dropout activations along the sequence dimension and replaces the per-block AllReduce with AllGather+ReduceScatter — same total bytes, but the activation that gets replicated drops by N=TP.

Across nodes, pipeline parallelism partitions the layer stack into K stages. The interleaved 1F1B schedule chops each stage into virtual sub-stages, reducing the bubble fraction to (K-1)/(M+K-1) where M is the number of micro-batches per pipeline fill. A 405B model on 1,024 H100s with TP=8, PP=16 and M=128 typically holds the bubble below 5 percent. P2P sends/receives between adjacent stages are NCCL-backed and overlap with the next forward / backward, which is why PP tolerates 400Gb InfiniBand bandwidth where TP demands NVLink.

The Distributed Adam optimiser (ZeRO-1 in DeepSpeed nomenclature) shards the BF16 master copy, FP32 first-moment and second-moment buffers across the DP group. Each rank computes the update for its 1/DP slice and AllGathers the resulting parameter delta. This drops optimiser-state memory from ~12 bytes/param per rank to ~12/DP — for a 70B BF16 model with DP=32 that is 26 GB per rank instead of 840 GB.

Tensor parallel group: shards weights of one transformer block, AllReduce per block. TP=2/4/8 inside one NVLink island.
Pipeline parallel group: layers split into K stages, P2P sends across stages. Interleaved 1F1B for minimal bubble.
Sequence parallel: TP-group sub-mode; shards LayerNorm/dropout activations along sequence dim. Free at long context.
Context parallel: shards the sequence itself across CP ranks; required at L > ~64K.
Data parallel: replicates everything else; AllReduce gradients each step. Distributed Adam shards the optimiser inside the DP group.
Transformer Engine: BF16/FP16 default, FP8 (E4M3/E5M2) on Hopper and Blackwell, FP4 (MXFP4) on Blackwell.
Selective recomputation: only the largest activations are dropped and recomputed; ~30 percent FLOPs overhead for 5-10x activation-memory savings.

The empirical rule from the 2021 Narayanan paper still holds: pick TP to fill the NVLink island (TP=8 on a DGX H100), then PP to fit the model in aggregate memory, then DP for the rest. Never use TP across InfiniBand; never use PP without enough micro-batches to keep the bubble small.

Reference and specifications#

Megatron-LM exposes its surface as a flag set on the `pretrain_gpt.py` / `pretrain_bert.py` / `pretrain_t5.py` entry-point scripts. The table below is the canonical reference for the flags that govern parallelism, precision, memory, and the optimiser as of Megatron-LM 0.8 / Megatron Core 0.11 (June 2026). Flags marked with an asterisk are also available as `TransformerConfig` fields in the Megatron Core library API.

Flag	Type	Default	Description
--num-layers *	int	(required)	Number of transformer blocks in the model.
--hidden-size *	int	(required)	Hidden dimension d_model.
--num-attention-heads *	int	(required)	Total attention heads (must divide hidden-size and TP size).
--num-query-groups *	int	= heads	Number of KV heads for GQA / MQA. 1 = MQA, < heads = GQA.
--ffn-hidden-size *	int	4 * hidden	Intermediate MLP dimension.
--seq-length *	int	(required)	Training sequence length in tokens.
--max-position-embeddings *	int	= seq-length	Position-embedding table size; relevant for absolute PE.
--position-embedding-type *	string	learned_absolute	learned_absolute \| rope \| alibi \| none.
--rotary-percent *	float	1.0	Fraction of head_dim that gets RoPE rotation.
--swiglu *	bool	false	Use SwiGLU MLP (Llama / Mistral); raises FFN compute ~1.5x.
--normalization *	string	LayerNorm	LayerNorm \| RMSNorm.
--tensor-model-parallel-size *	int	1	TP group size; shards each block's weights across N GPUs via NCCL.
--pipeline-model-parallel-size *	int	1	PP stage count; partitions layers across stages.
--virtual-pipeline-model-parallel-size	int	(off)	Interleaved 1F1B virtual stages; reduces bubble at high PP.
--context-parallel-size *	int	1	Shards the sequence dimension across CP ranks for very long context.
--sequence-parallel *	bool	false	Sharded LayerNorm/dropout activations within the TP group.
--expert-model-parallel-size *	int	1	MoE expert parallelism; shards experts across the EP group.
--num-experts *	int	(off)	Total MoE experts; presence enables Mixtral-style routing.
--moe-router-topk *	int	2	Top-K routing for MoE.
--micro-batch-size *	int	(required)	Per-rank batch within one forward; small (1-8) for big models.
--global-batch-size *	int	(required)	Global batch across DP; sets gradient-accumulation count.
--train-iters	int	(required)	Number of optimiser steps to train for.
--lr / --min-lr	float	(required)	Peak and floor learning rate for the schedule.
--lr-decay-style	string	linear	linear \| cosine \| inverse-square-root \| constant.
--lr-warmup-iters	int	0	Linear warmup steps before the main schedule kicks in.
--clip-grad	float	1.0	Global gradient-norm clipping threshold.
--weight-decay	float	0.01	AdamW weight decay.
--adam-beta1 / --adam-beta2	float	0.9 / 0.999	Adam exponential decay rates.
--fp16 *	bool	false	Mixed-precision training with FP16 + loss scaling.
--bf16 *	bool	false	Mixed-precision training with BF16 (preferred on Ampere+).
--fp8 *	string	(off)	hybrid \| e4m3 \| e5m2. Requires Transformer Engine + Hopper/Blackwell.
--fp8-amax-history-len	int	1024	Steps of amax history for FP8 scaling factors.
--use-flash-attn *	bool	false	Enable FlashAttention (FA2 on Ampere, FA3 on Hopper).
--use-distributed-optimizer *	bool	false	Distributed Adam (ZeRO-1 equivalent); shards optimiser state across DP.
--overlap-grad-reduce	bool	false	Overlap gradient ReduceScatter with backward compute.
--overlap-param-gather	bool	false	Overlap parameter AllGather with the next forward.
--recompute-granularity	string	(off)	selective \| full. Selective recomputes only the largest activations.
--recompute-method	string	uniform	uniform \| block. Block recomputes whole transformer blocks at once.
--recompute-num-layers	int	1	Number of layers per recompute group when --recompute-method=block.
--data-path *	string	(required)	Prefix to the preprocessed indexed dataset .bin/.idx.
--tokenizer-type	string	GPT2BPETokenizer	GPT2BPETokenizer \| SentencePieceTokenizer \| HuggingFaceTokenizer \| Llama3Tokenizer.
--save / --load	path	(required)	Checkpoint save and resume directories.
--save-interval	int	(required)	Steps between checkpoint writes.
--ckpt-format	string	torch	torch \| torch_dist. Use torch_dist for sharded async writes.
--tensorboard-dir	path	(off)	Enables TensorBoard logging.
--wandb-project	string	(off)	Enables Weights & Biases logging.

`--use-flash-attn` is a hard requirement at scale. With it off, Megatron falls back to a fused PyTorch attention path that materialises the full N^2 attention matrix in HBM — for L=8192 BF16 that is over 1 GB per head per micro-batch and will OOM before any other config matters.

Workload patterns#

Three workload shapes cover the bulk of Megatron-LM production usage: dense LLM pretraining from random weights at the 7B / 70B / 405B scales, continued pretraining of an open model on domain data, and large-scale SFT on instruction corpora. Each has its own parallelism and precision profile, and each maps cleanly to a Yobitel NeoCloud training-pod size — Pattern A on a 32-node (256-GPU) H100 pod with InfiniBand NDR, Pattern B on a 16-node (128-GPU) pod, Pattern C on a single 8x H100 node.

Pattern A — Dense pretraining of a Llama-style 70B from random weights on a 256-GPU H100 cluster (the canonical Yobitel NeoCloud 32-node training pod). Pattern B — Continued pretraining of Llama-3.1 70B on 100B tokens of domain corpus, starting from a HuggingFace checkpoint converted to Megatron format, on a 16-node NeoCloud pod. Pattern C — Large-scale supervised fine-tuning of Llama-3.1 8B on a 5M-example instruction set using packing, on a single 8x H100 NeoCloud node.

bash

# A — Llama-style 70B pretrain on 32 nodes x 8 H100 (256 GPUs)
#     TP=8 (intra-node NVLink) x PP=4 (cross-node IB) x DP=8
torchrun --nproc_per_node=8 --nnodes=32 \
    --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:29500 \
    pretrain_gpt.py \
    --num-layers 80 --hidden-size 8192 --num-attention-heads 64 \
    --num-query-groups 8 --ffn-hidden-size 28672 \
    --seq-length 8192 --max-position-embeddings 8192 \
    --position-embedding-type rope --swiglu --normalization RMSNorm \
    --micro-batch-size 1 --global-batch-size 1024 \
    --train-iters 480000 \
    --tensor-model-parallel-size 8 \
    --pipeline-model-parallel-size 4 \
    --virtual-pipeline-model-parallel-size 5 \
    --sequence-parallel --use-flash-attn \
    --bf16 --fp8 hybrid \
    --use-distributed-optimizer \
    --overlap-grad-reduce --overlap-param-gather \
    --recompute-granularity selective \
    --lr 1.5e-4 --min-lr 1.5e-5 --lr-decay-style cosine --lr-warmup-iters 4000 \
    --data-path ${DATA_PREFIX} \
    --tokenizer-type Llama3Tokenizer \
    --save ${CKPT_DIR} --load ${CKPT_DIR} --save-interval 2000 \
    --ckpt-format torch_dist

# B — Continued pretrain of Llama-3.1 70B on domain corpus, 16 nodes
#     Start from a HF -> Megatron-converted checkpoint
python tools/checkpoint/convert.py \
    --model-type GPT --loader llama_mistral --saver megatron \
    --load-dir ${HF_DIR} --save-dir ${MEGATRON_INIT} \
    --target-tensor-parallel-size 8 --target-pipeline-parallel-size 2

torchrun --nproc_per_node=8 --nnodes=16 ... \
    pretrain_gpt.py \
    --load ${MEGATRON_INIT} --no-load-optim --no-load-rng \
    --finetune \
    --global-batch-size 512 --micro-batch-size 1 \
    --train-iters 100000 --lr 5e-5 --min-lr 5e-6 \
    --tensor-model-parallel-size 8 --pipeline-model-parallel-size 2 \
    --sequence-parallel --use-flash-attn --bf16 \
    --use-distributed-optimizer

# C — SFT of Llama-3.1 8B on instruction corpus, single 8x H100 node
#     Use NeMo or NeMo-Aligner for the loss/packing surface
torchrun --nproc_per_node=8 \
    examples/nlp/language_modeling/megatron_gpt_finetune.py \
    --config-path conf --config-name megatron_gpt_sft \
    model.restore_from_path=${LLAMA_8B_CKPT} \
    model.tensor_model_parallel_size=2 \
    model.pipeline_model_parallel_size=1 \
    model.data.train_ds.packed_sequence=True \
    model.data.train_ds.max_seq_length=8192 \
    model.optim.lr=2e-6 \
    trainer.max_steps=20000

Sizing and capacity planning#

The two questions that drive Megatron sizing are: which parallelism shape fits the model, and how many GPUs do I need to finish in a fixed wall-clock? The table below gives reference parallelism configs and observed throughput on H100 SXM5 clusters with 400Gb InfiniBand NDR, BF16+FP8 mixed precision, FA3 attention, sequence parallel on, distributed optimiser on, and selective recomputation. Throughput is per-GPU sustained tokens-per-second from internal training-run telemetry and the published Nemotron and NVIDIA H100 MLPerf records; treat as planning anchors.

Weak-scaling efficiency on H100 with the above recipe holds at 85-92 percent from 256 to 4,096 GPUs; above 4,096 it drops to 75-85 percent as DP AllReduce starts to dominate.
Move to Blackwell (B200) and FP4 MXFP for a roughly 2.7x throughput uplift on dense 70B-class training versus the same shape on H100 FP8.
MoE models add expert parallelism (EP) inside the TP group; EP=8 is the typical choice for 8-expert top-2 routing. Expert AllToAll dominates at EP>8 unless on NVLink Switch.
Context parallelism becomes necessary above ~64K sequence length; below that, sequence parallel inside TP is the simpler choice.

Model size	Cluster	TP x PP x DP	Global batch	Per-GPU tok/s	Days for 1T tokens
8B (Llama-style)	8x H100	1 x 1 x 8	1024	12,500	11.6
8B	64x H100	1 x 1 x 64	1024	11,800	1.5
70B	256x H100	8 x 4 x 8	1024	4,200	10.8
70B	512x H100	8 x 4 x 16	1024	4,000	5.6
175B (GPT-3 class)	1024x H100	8 x 8 x 16	1536	2,400	4.7
340B (Nemotron class)	2048x H100	8 x 16 x 16	2304	1,500	3.8
405B (Llama-3.1 class)	4096x H100	8 x 16 x 32	2304	1,250	2.3
405B on Blackwell	1024x B200	8 x 8 x 16	2304	3,400	3.3
1T (frontier)	8192x H100	8 x 32 x 32	4096	750	1.9
8x22B MoE (Mixtral class)	512x H100	8 x 4 x 16 EP=8	1024	2,800	8.0

Run the official `examples/llama` recipes as your baseline before adapting. They encode flag values (--overlap-grad-reduce, --overlap-param-gather, --virtual-pipeline-model-parallel-size, --recompute-granularity selective) that are easy to forget and cost 10-20 percent of throughput when missed.

Limits and quotas#

Megatron itself has few hard limits; what bounds a run are GPU memory, NCCL group counts, and the data-loader contract. The table below summarises the constraints worth knowing before designing a parallelism shape.

Constraint	Default / ceiling	How to manage
hidden_size divisible by TP	Required	Choose TP from {1,2,4,8} that divides hidden.
num_attention_heads divisible by TP	Required	Constrains TP for thin-head architectures.
num_query_groups divisible by TP	Required for GQA	If GQA=8, TP cannot exceed 8.
num_layers divisible by PP * VPP	Required	Choose PP so layers split evenly.
seq_length divisible by TP (with SP)	Required	Pad short batches; choose TP that divides seq.
seq_length divisible by CP	Required for CP	Choose CP from {1,2,4,8} that divides seq.
Micro-batch size	1 typical at 70B+	Larger MB inflates activations; PP needs >= PP micro-batches.
NCCL communicator count	World-size dependent	Set NCCL_COMM_ID, NCCL_NET_GDR_LEVEL; use NCCL >= 2.20.
Indexed dataset size	Per-file ~1TB	Split into multiple .bin/.idx files; Megatron concatenates.
Checkpoint size	Model + 12 bytes/param (Adam)	Use --ckpt-format torch_dist for sharded async writes.
FP8 amax history	1024 steps	Raise for very long training; storage cost negligible.
Single-step wallclock	30-90s typical at 70B	Iteration time should hold within +-3 percent in steady state.

Observability#

Megatron-LM emits TensorBoard scalars and optional Weights & Biases logs covering per-step loss, learning rate, gradient norm, samples-per-second, TFLOPs-per-GPU, FP8 amax, optimiser-state norms, and (under --log-memory-to-tensorboard) GPU memory peaks per phase. The metrics that matter operationally at training-cluster scale are throughput stability, gradient-norm sanity, and FP8 scaling-factor health.

On Hopper / Blackwell, pair Megatron logs with NVIDIA DCGM exporter (`DCGM_FI_DEV_GPU_UTIL`, `DCGM_FI_DEV_MEM_COPY_UTIL`, `DCGM_FI_PROF_SM_ACTIVE`, `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`, `DCGM_FI_PROF_NVLINK_TX_BYTES`) and NCCL profiling (`NCCL_DEBUG=INFO`, `NCCL_DEBUG_SUBSYS=COLL`). The three signals that detect 90 percent of training-cluster problems are: iteration time variance, NCCL AllReduce p99 latency, and per-rank step-loss divergence.

iteration-time / samples-per-second: holds within +-3 percent in steady state; >10 percent dip means stragglers, IB packet loss, or thermal throttling.
TFLOPs-per-GPU: should hit 40-60 percent of the device peak; drops correlate with attention shape or recompute config.
grad-norm: spikes above 10x baseline indicate divergence; correlate with LR schedule.
loss curve: per-rank loss should match within numerical noise; divergence means a DP rank dropped or got bad data.
FP8 amax_history_max: persistent saturation indicates clip-and-rescale should fire; never-saturating means FP8 is wasted.
DCGM PIPE_TENSOR_ACTIVE: the most honest tensor-core utilisation signal; pair with SM_ACTIVE.
NCCL_DEBUG=WARN in steady state; flip to INFO when investigating slow steps.

yaml

# Prometheus rules for a Megatron-LM pretraining job
groups:
  - name: megatron-training
    interval: 60s
    rules:
      - alert: MegatronIterationTimeRegression
        expr: |
          (avg_over_time(megatron:iteration_time_seconds[5m])
           / avg_over_time(megatron:iteration_time_seconds[1h] offset 30m)) > 1.10
        for: 10m
        labels: { severity: warning, team: training }
        annotations:
          summary: "Iteration time +10% vs 1h baseline on {{ $labels.job_name }}"

      - alert: MegatronGradNormSpike
        expr: megatron:grad_norm > 10 * avg_over_time(megatron:grad_norm[1h] offset 30m)
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Grad-norm spike — investigate LR or data corruption"

      - alert: MegatronTFLOPsCollapse
        expr: avg_over_time(megatron:tflops_per_gpu[10m]) < 0.5 *
              avg_over_time(megatron:tflops_per_gpu[1h] offset 30m)
        for: 15m
        labels: { severity: critical }
        annotations:
          summary: "Per-GPU TFLOPs halved — check NCCL, FA3 kernel, recompute config"

      - alert: MegatronNCCLAllReduceP99
        expr: histogram_quantile(0.99,
                rate(nccl:allreduce_seconds_bucket[5m])) > 0.2
        for: 10m
        annotations:
          summary: "NCCL AllReduce p99 > 200ms — fabric or topology regression"

      - alert: MegatronFP8AmaxSaturation
        expr: megatron:fp8_amax_history_max / megatron:fp8_amax_history_threshold > 0.95
        for: 30m
        annotations:
          summary: "FP8 amax persistently saturated — recompute scale or fall back to BF16"

Cost and FinOps#

Megatron-LM cost is almost entirely a function of two numbers: GPU-hour rate and per-GPU sustained throughput. The training-cost-per-trillion-tokens for a given model size is `(cluster_GPUs x hours x hourly_rate)` where hours = (1T / (per_GPU_tok_s x cluster_GPUs x 3600)). The table below uses Yobitel UK list pricing (June 2026) for on-demand H100 SXM5 ($3.10/GPU-hour), reserved H100 ($1.95), and B200 ($5.50 on-demand) and the throughput anchors from the Sizing table.

Reserved capacity (1-3 year terms) drops the rate ~35-40 percent for predictable multi-month pretraining runs.
Blackwell B200 doubles to triples per-GPU tok/s on dense models; the rate premium pays back within the first month of training.
FP8 weights + FP8 activations cut wall-clock 1.5-1.8x on H100 versus pure BF16; FP4 on Blackwell another 1.5x.
Checkpoint storage is a non-trivial line item — a 405B checkpoint with Adam state is ~5 TB; sharded async writes (`--ckpt-format torch_dist`) and pruning intermediate checkpoints matter.
Failed/restarted iterations are the silent cost. A 10-percent restart rate doubles the schedule slip. Spend on storage, network, and DCGM monitoring before spending on more GPUs.

Model	Cluster	Per-GPU tok/s	$/GPU-hour	USD per 1T tokens
70B on H100 on-demand	256x H100	4,200	$3.10	$521,000
70B on H100 reserved	256x H100	4,200	$1.95	$328,000
175B on H100 reserved	1024x H100	2,400	$1.95	$2.31M
340B on H100 reserved	2048x H100	1,500	$1.95	$7.40M
405B on H100 reserved	4096x H100	1,250	$1.95	$17.7M
405B on B200 reserved	1024x B200	3,400	$3.45	$2.86M
1T on H100 reserved	8192x H100	750	$1.95	$59.0M

Security and compliance#

Megatron-LM has no built-in auth surface — it is a training script, not a serving system. Security controls apply at the cluster boundary: Slurm or Kubernetes job submission auth, private container registry for the NVIDIA / NeMo base images, encrypted weights-at-rest on the checkpoint filesystem (Lustre with kernel crypto or NetApp with NVE), and network isolation of the training fabric from public ingress. Training-time data does not leave the cluster; the only egress is checkpoint writes and metrics shipping.

For UK and EU sovereign training programmes — government foundation models, regulated-industry domain pretrains — Megatron-LM runs inside the same sovereign tenancies that satisfy NCSC Cloud Security Principles and G-Cloud 14 lot definitions. The framework itself is Apache 2.0 (with NVIDIA-specific clauses around brand and trademark) and has no telemetry call-home; the only external dependency at runtime is the W&B / TensorBoard sink if you configure one. For air-gapped environments, mirror the NGC and PyPI dependencies into the internal registry and pin Transformer Engine and Apex versions.

Reproducibility for compliance audits requires pinning the entire stack: Megatron-LM commit hash, Transformer Engine version, Apex version, CUDA, cuDNN, NCCL, driver, base image digest. NVIDIA NeMo's NGC containers ship the full pin set in `/etc/nvidia/container-versions.txt`.

Migration and alternatives#

Three migration paths dominate. From FSDP/HSDP: the trigger is usually scale — FSDP at FULL_SHARD is competitive through ~30B, manageable to ~70B with HSDP, but loses ground above that as the per-layer AllGather amortises poorly. The move to Megatron buys you true tensor parallelism (no per-layer parameter materialisation) and interleaved 1F1B pipeline. From DeepSpeed: similar capability surface but DeepSpeed leans on ZeRO-3 for memory while Megatron leans on TP+PP for compute scaling. From NeMo: NeMo wraps Megatron Core, so 'migrating to Megatron' from NeMo is really 'dropping a layer of recipe ergonomics' — useful when you need to change something NeMo's Hydra configs do not expose.

From	Trigger to migrate	Effort	What you gain / lose
FSDP / HSDP (PyTorch)	Model > 70B, throughput plateau on Hopper	Medium — config + data pipeline rewrite	Gain TP+PP, FP8 via TE; lose PyTorch-native ergonomics.
DeepSpeed ZeRO-3	Model > 100B or interconnect-bound at ZeRO-3	Low-medium — similar mental model	Gain explicit 3D parallelism; lose ZeRO-Infinity offload.
NeMo Framework	Need flag not exposed by NeMo Hydra config	Low — same engine underneath	Gain config flexibility; lose recipe catalogue + SFT/RLHF wrappers.
JAX + Pax / MaxText	Switching from TPU to GPU at scale	High — different framework family	Gain CUDA-optimised path; lose JAX functional purity.
Colossal-AI	Specific optimisation that Colossal exposes (e.g. heterogeneous training)	Low — Colossal vendors Megatron primitives	Gain Megatron's stability; lose Colossal's research features.

Do not migrate to Megatron-LM for fine-tuning or LoRA work. The framework is heavyweight by design (binary indexed datasets, Megatron checkpoint format, configuration sprawl). For SFT, DPO, LoRA, use NeMo (which wraps Megatron with recipes), torchtune, axolotl, or the HuggingFace trl stack. Megatron is the right tool when you are starting from random weights at 32+ GPUs.

Troubleshooting#

The error patterns below account for most production Megatron incidents observed on Yobitel-operated training fleets and the public NVIDIA NeMo issue tracker. Each row maps the observable symptom to the underlying mechanism and the minimum-viable fix.

Symptom	Cause	Fix
NCCL hang on init	Mismatched NCCL versions across nodes or /dev/shm too small.	Pin NCCL >= 2.20 cluster-wide; mount /dev/shm >= 16GB; export NCCL_DEBUG=INFO.
torch.cuda.OOM during forward at step 0	TP/PP shape allocates more activation than expected at peak.	Lower --micro-batch-size to 1; enable --recompute-granularity selective.
loss spikes to NaN after a few hundred steps	FP16 loss scaling collapsed (use --bf16) or gradient clipping off.	Switch to --bf16; set --clip-grad 1.0; check data for outliers.
TFLOPs drop suddenly mid-training	Background daemon stealing SMs, or thermal throttling on a rack.	Pin nvidia-smi clocks; check dmesg for ECC errors; correlate to DCGM thermals.
FP8 amax history all zeros	FP8 not actually enabled (TE not linked) or scale path bypassed.	Verify `--fp8 hybrid` accepted; check TE version compatibility; confirm via nvprof.
Pipeline bubble dominates step time	Too few micro-batches per pipeline fill (M < 4*K).	Raise --global-batch-size or lower --pipeline-model-parallel-size.
Per-rank loss diverges across DP group	DP rank received corrupted data shard or gradient AllReduce failed silently.	Enable --consistency-check; rerun preprocess; verify identical seeds across ranks.
Distributed Adam errors on resume	--use-distributed-optimizer flag flipped between save and load.	Always preserve optimiser flags across resumes; never mix sharded and replicated state.
Sequence parallel produces wrong gradients	Custom CUDA kernel assumes contiguous activation; SP shards it.	Rewrite kernel to take stride args, or disable SP for the affected block.
HuggingFace conversion produces gibberish	Tokenizer or weight-permutation mismatch (GQA, RoPE base, etc.).	Use the version-matched `tools/checkpoint/convert.py`; verify with a 10-token sanity prompt.
torchrun rendezvous fails on multi-node	MASTER_ADDR unreachable or firewall on port 29500.	Use c10d backend with explicit --rdzv_endpoint; open ports; verify hostname resolution.
DataLoader stalls at iteration boundary	Indexed dataset file on slow NFS; preprocess sharded for parallel reads.	Use Lustre or local NVMe; verify --num-workers; cache .idx files on each node.

Where this fits in the Yobitel stack#

Megatron-LM (via NeMo Framework) is the standard training engine on Yobitel sovereign GPU tenancies for any pretraining or continued-pretraining workload above 32 GPUs. Yobitel ships NGC-derived NeMo + Megatron Core containers, pre-validated NCCL and InfiniBand configurations for the H100, H200, and B200 fleets, and reference Slurm + Pyxis launch scripts that encode the parallelism-shape recommendations from this entry's Sizing table.

For customers training UK or EU sovereign foundation models — government, defence, regulated-industry — Megatron runs inside London-1 and Frankfurt-1 tenancies satisfying NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat. Weights never leave the customer tenancy; the Yobitel control plane handles only metrics, scheduling, and health telemetry. Where SFT / DPO / RLHF is needed downstream, the same checkpoints flow into NeMo-Aligner on the same cluster without re-sharding.

Yobibyte, Yobitel's managed AI-native platform, exposes Megatron-trained models for inference via its vLLM and TensorRT-LLM endpoints; InferenceBench v3 then scores throughput and latency on the deployed checkpoints across the H100, H200, B200 and MI300X SKUs the customer rents. The training-to-serving handoff is one HF-format conversion away.

References

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism · arXiv (Shoeybi et al., 2019)
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM · arXiv (Narayanan et al., 2021)
Reducing Activation Recomputation in Large Transformer Models · arXiv (Korthikanti et al., 2022)
Megatron-LM on GitHub · GitHub (NVIDIA)
Megatron Core documentation · NVIDIA
NVIDIA Transformer Engine · GitHub (NVIDIA)
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B · arXiv (Smith et al., 2022)

TL;DR

Open-source training framework from NVIDIA Applied Deep Learning Research, first released alongside the 2019 Megatron-LM paper (Shoeybi et al., arXiv:1909.08053). Apache 2.0 with NVIDIA-specific clauses; hosted at github.com/NVIDIA/Megatron-LM.
Origin of tensor parallelism, sequence parallelism, selective activation recomputation, and interleaved 1F1B pipeline parallelism as practised today; refactored into Megatron Core, the library form embedded in NeMo Framework, NeMo-Aligner, NVIDIA Nemotron training, and most NVIDIA-partner pretraining stacks.
Composes five parallelism dimensions — data, tensor, pipeline, sequence, context — plus Distributed Adam (ZeRO-1-style) and Transformer Engine FP8 / FP4 kernels. Sustains 40-60 percent of theoretical peak FLOPs on H100 clusters and has been demonstrated at 16,384 H100 scale.
The codebase used (directly or via NeMo) for training GPT-3-class to GPT-4-class open models — Megatron-Turing NLG 530B, Llama-3 derivatives, Nemotron-4 340B, Falcon, BLOOM derivatives — and the empirical reference for the parallelism-strategy choices made by every other large-scale training framework.

Overview#

Quick start#

bash

# 1. Clone Megatron-LM and prepare a sample indexed dataset
git clone https://github.com/NVIDIA/Megatron-LM && cd Megatron-LM
pip install -r requirements.txt
pip install "transformer-engine[pytorch]" apex  # CUDA 12.4+

python tools/preprocess_data.py \
    --input ./data/oscar-sample.jsonl \
    --output-prefix ./data/oscar-sample \
    --vocab-file ./vocab/gpt2-vocab.json \
    --merge-file ./vocab/gpt2-merges.txt \
    --tokenizer-type GPT2BPETokenizer \
    --workers 32 --append-eod
# Produces oscar-sample_text_document.{bin,idx}

# 2. Pretrain a 1.3B GPT on 8x H100 with TP=2, DP=4, BF16, FA3, SP
GPUS_PER_NODE=8
torchrun --nproc_per_node=$GPUS_PER_NODE \
    pretrain_gpt.py \
    --num-layers 24 --hidden-size 2048 --num-attention-heads 16 \
    --seq-length 4096 --max-position-embeddings 4096 \
    --micro-batch-size 4 --global-batch-size 256 \
    --train-iters 50000 --lr 2.0e-4 --min-lr 2.0e-5 \
    --lr-decay-style cosine --lr-warmup-iters 2000 \
    --tensor-model-parallel-size 2 \
    --pipeline-model-parallel-size 1 \
    --sequence-parallel \
    --use-flash-attn \
    --bf16 \
    --use-distributed-optimizer \
    --recompute-granularity selective \
    --data-path ./data/oscar-sample_text_document \
    --vocab-file ./vocab/gpt2-vocab.json \
    --merge-file ./vocab/gpt2-merges.txt \
    --save ./checkpoints/gpt-1b3 --load ./checkpoints/gpt-1b3 \
    --save-interval 5000 --log-interval 10 \
    --tensorboard-dir ./tb

# 3. Convert the Megatron checkpoint to HuggingFace format for serving
python tools/checkpoint/convert.py \
    --model-type GPT --loader megatron --saver llama_mistral \
    --load-dir ./checkpoints/gpt-1b3 \
    --save-dir ./hf/gpt-1b3-hf \
    --tokenizer-model ./vocab/gpt2-vocab.json

How it works#

Tensor parallel group: shards weights of one transformer block, AllReduce per block. TP=2/4/8 inside one NVLink island.
Pipeline parallel group: layers split into K stages, P2P sends across stages. Interleaved 1F1B for minimal bubble.
Sequence parallel: TP-group sub-mode; shards LayerNorm/dropout activations along sequence dim. Free at long context.
Context parallel: shards the sequence itself across CP ranks; required at L > ~64K.
Data parallel: replicates everything else; AllReduce gradients each step. Distributed Adam shards the optimiser inside the DP group.
Transformer Engine: BF16/FP16 default, FP8 (E4M3/E5M2) on Hopper and Blackwell, FP4 (MXFP4) on Blackwell.
Selective recomputation: only the largest activations are dropped and recomputed; ~30 percent FLOPs overhead for 5-10x activation-memory savings.

Reference and specifications#

Flag	Type	Default	Description
--num-layers *	int	(required)	Number of transformer blocks in the model.
--hidden-size *	int	(required)	Hidden dimension d_model.
--num-attention-heads *	int	(required)	Total attention heads (must divide hidden-size and TP size).
--num-query-groups *	int	= heads	Number of KV heads for GQA / MQA. 1 = MQA, < heads = GQA.
--ffn-hidden-size *	int	4 * hidden	Intermediate MLP dimension.
--seq-length *	int	(required)	Training sequence length in tokens.
--max-position-embeddings *	int	= seq-length	Position-embedding table size; relevant for absolute PE.
--position-embedding-type *	string	learned_absolute	learned_absolute \| rope \| alibi \| none.
--rotary-percent *	float	1.0	Fraction of head_dim that gets RoPE rotation.
--swiglu *	bool	false	Use SwiGLU MLP (Llama / Mistral); raises FFN compute ~1.5x.
--normalization *	string	LayerNorm	LayerNorm \| RMSNorm.
--tensor-model-parallel-size *	int	1	TP group size; shards each block's weights across N GPUs via NCCL.
--pipeline-model-parallel-size *	int	1	PP stage count; partitions layers across stages.
--virtual-pipeline-model-parallel-size	int	(off)	Interleaved 1F1B virtual stages; reduces bubble at high PP.
--context-parallel-size *	int	1	Shards the sequence dimension across CP ranks for very long context.
--sequence-parallel *	bool	false	Sharded LayerNorm/dropout activations within the TP group.
--expert-model-parallel-size *	int	1	MoE expert parallelism; shards experts across the EP group.
--num-experts *	int	(off)	Total MoE experts; presence enables Mixtral-style routing.
--moe-router-topk *	int	2	Top-K routing for MoE.
--micro-batch-size *	int	(required)	Per-rank batch within one forward; small (1-8) for big models.
--global-batch-size *	int	(required)	Global batch across DP; sets gradient-accumulation count.
--train-iters	int	(required)	Number of optimiser steps to train for.
--lr / --min-lr	float	(required)	Peak and floor learning rate for the schedule.
--lr-decay-style	string	linear	linear \| cosine \| inverse-square-root \| constant.
--lr-warmup-iters	int	0	Linear warmup steps before the main schedule kicks in.
--clip-grad	float	1.0	Global gradient-norm clipping threshold.
--weight-decay	float	0.01	AdamW weight decay.
--adam-beta1 / --adam-beta2	float	0.9 / 0.999	Adam exponential decay rates.
--fp16 *	bool	false	Mixed-precision training with FP16 + loss scaling.
--bf16 *	bool	false	Mixed-precision training with BF16 (preferred on Ampere+).
--fp8 *	string	(off)	hybrid \| e4m3 \| e5m2. Requires Transformer Engine + Hopper/Blackwell.
--fp8-amax-history-len	int	1024	Steps of amax history for FP8 scaling factors.
--use-flash-attn *	bool	false	Enable FlashAttention (FA2 on Ampere, FA3 on Hopper).
--use-distributed-optimizer *	bool	false	Distributed Adam (ZeRO-1 equivalent); shards optimiser state across DP.
--overlap-grad-reduce	bool	false	Overlap gradient ReduceScatter with backward compute.
--overlap-param-gather	bool	false	Overlap parameter AllGather with the next forward.
--recompute-granularity	string	(off)	selective \| full. Selective recomputes only the largest activations.
--recompute-method	string	uniform	uniform \| block. Block recomputes whole transformer blocks at once.
--recompute-num-layers	int	1	Number of layers per recompute group when --recompute-method=block.
--data-path *	string	(required)	Prefix to the preprocessed indexed dataset .bin/.idx.
--tokenizer-type	string	GPT2BPETokenizer	GPT2BPETokenizer \| SentencePieceTokenizer \| HuggingFaceTokenizer \| Llama3Tokenizer.
--save / --load	path	(required)	Checkpoint save and resume directories.
--save-interval	int	(required)	Steps between checkpoint writes.
--ckpt-format	string	torch	torch \| torch_dist. Use torch_dist for sharded async writes.
--tensorboard-dir	path	(off)	Enables TensorBoard logging.
--wandb-project	string	(off)	Enables Weights & Biases logging.

Workload patterns#

bash

# A — Llama-style 70B pretrain on 32 nodes x 8 H100 (256 GPUs)
#     TP=8 (intra-node NVLink) x PP=4 (cross-node IB) x DP=8
torchrun --nproc_per_node=8 --nnodes=32 \
    --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:29500 \
    pretrain_gpt.py \
    --num-layers 80 --hidden-size 8192 --num-attention-heads 64 \
    --num-query-groups 8 --ffn-hidden-size 28672 \
    --seq-length 8192 --max-position-embeddings 8192 \
    --position-embedding-type rope --swiglu --normalization RMSNorm \
    --micro-batch-size 1 --global-batch-size 1024 \
    --train-iters 480000 \
    --tensor-model-parallel-size 8 \
    --pipeline-model-parallel-size 4 \
    --virtual-pipeline-model-parallel-size 5 \
    --sequence-parallel --use-flash-attn \
    --bf16 --fp8 hybrid \
    --use-distributed-optimizer \
    --overlap-grad-reduce --overlap-param-gather \
    --recompute-granularity selective \
    --lr 1.5e-4 --min-lr 1.5e-5 --lr-decay-style cosine --lr-warmup-iters 4000 \
    --data-path ${DATA_PREFIX} \
    --tokenizer-type Llama3Tokenizer \
    --save ${CKPT_DIR} --load ${CKPT_DIR} --save-interval 2000 \
    --ckpt-format torch_dist

# B — Continued pretrain of Llama-3.1 70B on domain corpus, 16 nodes
#     Start from a HF -> Megatron-converted checkpoint
python tools/checkpoint/convert.py \
    --model-type GPT --loader llama_mistral --saver megatron \
    --load-dir ${HF_DIR} --save-dir ${MEGATRON_INIT} \
    --target-tensor-parallel-size 8 --target-pipeline-parallel-size 2

torchrun --nproc_per_node=8 --nnodes=16 ... \
    pretrain_gpt.py \
    --load ${MEGATRON_INIT} --no-load-optim --no-load-rng \
    --finetune \
    --global-batch-size 512 --micro-batch-size 1 \
    --train-iters 100000 --lr 5e-5 --min-lr 5e-6 \
    --tensor-model-parallel-size 8 --pipeline-model-parallel-size 2 \
    --sequence-parallel --use-flash-attn --bf16 \
    --use-distributed-optimizer

# C — SFT of Llama-3.1 8B on instruction corpus, single 8x H100 node
#     Use NeMo or NeMo-Aligner for the loss/packing surface
torchrun --nproc_per_node=8 \
    examples/nlp/language_modeling/megatron_gpt_finetune.py \
    --config-path conf --config-name megatron_gpt_sft \
    model.restore_from_path=${LLAMA_8B_CKPT} \
    model.tensor_model_parallel_size=2 \
    model.pipeline_model_parallel_size=1 \
    model.data.train_ds.packed_sequence=True \
    model.data.train_ds.max_seq_length=8192 \
    model.optim.lr=2e-6 \
    trainer.max_steps=20000

Sizing and capacity planning#

Weak-scaling efficiency on H100 with the above recipe holds at 85-92 percent from 256 to 4,096 GPUs; above 4,096 it drops to 75-85 percent as DP AllReduce starts to dominate.
Move to Blackwell (B200) and FP4 MXFP for a roughly 2.7x throughput uplift on dense 70B-class training versus the same shape on H100 FP8.
MoE models add expert parallelism (EP) inside the TP group; EP=8 is the typical choice for 8-expert top-2 routing. Expert AllToAll dominates at EP>8 unless on NVLink Switch.
Context parallelism becomes necessary above ~64K sequence length; below that, sequence parallel inside TP is the simpler choice.

Model size	Cluster	TP x PP x DP	Global batch	Per-GPU tok/s	Days for 1T tokens
8B (Llama-style)	8x H100	1 x 1 x 8	1024	12,500	11.6
8B	64x H100	1 x 1 x 64	1024	11,800	1.5
70B	256x H100	8 x 4 x 8	1024	4,200	10.8
70B	512x H100	8 x 4 x 16	1024	4,000	5.6
175B (GPT-3 class)	1024x H100	8 x 8 x 16	1536	2,400	4.7
340B (Nemotron class)	2048x H100	8 x 16 x 16	2304	1,500	3.8
405B (Llama-3.1 class)	4096x H100	8 x 16 x 32	2304	1,250	2.3
405B on Blackwell	1024x B200	8 x 8 x 16	2304	3,400	3.3
1T (frontier)	8192x H100	8 x 32 x 32	4096	750	1.9
8x22B MoE (Mixtral class)	512x H100	8 x 4 x 16 EP=8	1024	2,800	8.0

Limits and quotas#

Constraint	Default / ceiling	How to manage
hidden_size divisible by TP	Required	Choose TP from {1,2,4,8} that divides hidden.
num_attention_heads divisible by TP	Required	Constrains TP for thin-head architectures.
num_query_groups divisible by TP	Required for GQA	If GQA=8, TP cannot exceed 8.
num_layers divisible by PP * VPP	Required	Choose PP so layers split evenly.
seq_length divisible by TP (with SP)	Required	Pad short batches; choose TP that divides seq.
seq_length divisible by CP	Required for CP	Choose CP from {1,2,4,8} that divides seq.
Micro-batch size	1 typical at 70B+	Larger MB inflates activations; PP needs >= PP micro-batches.
NCCL communicator count	World-size dependent	Set NCCL_COMM_ID, NCCL_NET_GDR_LEVEL; use NCCL >= 2.20.
Indexed dataset size	Per-file ~1TB	Split into multiple .bin/.idx files; Megatron concatenates.
Checkpoint size	Model + 12 bytes/param (Adam)	Use --ckpt-format torch_dist for sharded async writes.
FP8 amax history	1024 steps	Raise for very long training; storage cost negligible.
Single-step wallclock	30-90s typical at 70B	Iteration time should hold within +-3 percent in steady state.

Observability#

iteration-time / samples-per-second: holds within +-3 percent in steady state; >10 percent dip means stragglers, IB packet loss, or thermal throttling.
TFLOPs-per-GPU: should hit 40-60 percent of the device peak; drops correlate with attention shape or recompute config.
grad-norm: spikes above 10x baseline indicate divergence; correlate with LR schedule.
loss curve: per-rank loss should match within numerical noise; divergence means a DP rank dropped or got bad data.
FP8 amax_history_max: persistent saturation indicates clip-and-rescale should fire; never-saturating means FP8 is wasted.
DCGM PIPE_TENSOR_ACTIVE: the most honest tensor-core utilisation signal; pair with SM_ACTIVE.
NCCL_DEBUG=WARN in steady state; flip to INFO when investigating slow steps.

yaml

# Prometheus rules for a Megatron-LM pretraining job
groups:
  - name: megatron-training
    interval: 60s
    rules:
      - alert: MegatronIterationTimeRegression
        expr: |
          (avg_over_time(megatron:iteration_time_seconds[5m])
           / avg_over_time(megatron:iteration_time_seconds[1h] offset 30m)) > 1.10
        for: 10m
        labels: { severity: warning, team: training }
        annotations:
          summary: "Iteration time +10% vs 1h baseline on {{ $labels.job_name }}"

      - alert: MegatronGradNormSpike
        expr: megatron:grad_norm > 10 * avg_over_time(megatron:grad_norm[1h] offset 30m)
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Grad-norm spike — investigate LR or data corruption"

      - alert: MegatronTFLOPsCollapse
        expr: avg_over_time(megatron:tflops_per_gpu[10m]) < 0.5 *
              avg_over_time(megatron:tflops_per_gpu[1h] offset 30m)
        for: 15m
        labels: { severity: critical }
        annotations:
          summary: "Per-GPU TFLOPs halved — check NCCL, FA3 kernel, recompute config"

      - alert: MegatronNCCLAllReduceP99
        expr: histogram_quantile(0.99,
                rate(nccl:allreduce_seconds_bucket[5m])) > 0.2
        for: 10m
        annotations:
          summary: "NCCL AllReduce p99 > 200ms — fabric or topology regression"

      - alert: MegatronFP8AmaxSaturation
        expr: megatron:fp8_amax_history_max / megatron:fp8_amax_history_threshold > 0.95
        for: 30m
        annotations:
          summary: "FP8 amax persistently saturated — recompute scale or fall back to BF16"

Cost and FinOps#

Reserved capacity (1-3 year terms) drops the rate ~35-40 percent for predictable multi-month pretraining runs.
Blackwell B200 doubles to triples per-GPU tok/s on dense models; the rate premium pays back within the first month of training.
FP8 weights + FP8 activations cut wall-clock 1.5-1.8x on H100 versus pure BF16; FP4 on Blackwell another 1.5x.
Checkpoint storage is a non-trivial line item — a 405B checkpoint with Adam state is ~5 TB; sharded async writes (`--ckpt-format torch_dist`) and pruning intermediate checkpoints matter.
Failed/restarted iterations are the silent cost. A 10-percent restart rate doubles the schedule slip. Spend on storage, network, and DCGM monitoring before spending on more GPUs.

Model	Cluster	Per-GPU tok/s	$/GPU-hour	USD per 1T tokens
70B on H100 on-demand	256x H100	4,200	$3.10	$521,000
70B on H100 reserved	256x H100	4,200	$1.95	$328,000
175B on H100 reserved	1024x H100	2,400	$1.95	$2.31M
340B on H100 reserved	2048x H100	1,500	$1.95	$7.40M
405B on H100 reserved	4096x H100	1,250	$1.95	$17.7M
405B on B200 reserved	1024x B200	3,400	$3.45	$2.86M
1T on H100 reserved	8192x H100	750	$1.95	$59.0M

Security and compliance#

Migration and alternatives#

From	Trigger to migrate	Effort	What you gain / lose
FSDP / HSDP (PyTorch)	Model > 70B, throughput plateau on Hopper	Medium — config + data pipeline rewrite	Gain TP+PP, FP8 via TE; lose PyTorch-native ergonomics.
DeepSpeed ZeRO-3	Model > 100B or interconnect-bound at ZeRO-3	Low-medium — similar mental model	Gain explicit 3D parallelism; lose ZeRO-Infinity offload.
NeMo Framework	Need flag not exposed by NeMo Hydra config	Low — same engine underneath	Gain config flexibility; lose recipe catalogue + SFT/RLHF wrappers.
JAX + Pax / MaxText	Switching from TPU to GPU at scale	High — different framework family	Gain CUDA-optimised path; lose JAX functional purity.
Colossal-AI	Specific optimisation that Colossal exposes (e.g. heterogeneous training)	Low — Colossal vendors Megatron primitives	Gain Megatron's stability; lose Colossal's research features.

Troubleshooting#

Symptom	Cause	Fix
NCCL hang on init	Mismatched NCCL versions across nodes or /dev/shm too small.	Pin NCCL >= 2.20 cluster-wide; mount /dev/shm >= 16GB; export NCCL_DEBUG=INFO.
torch.cuda.OOM during forward at step 0	TP/PP shape allocates more activation than expected at peak.	Lower --micro-batch-size to 1; enable --recompute-granularity selective.
loss spikes to NaN after a few hundred steps	FP16 loss scaling collapsed (use --bf16) or gradient clipping off.	Switch to --bf16; set --clip-grad 1.0; check data for outliers.
TFLOPs drop suddenly mid-training	Background daemon stealing SMs, or thermal throttling on a rack.	Pin nvidia-smi clocks; check dmesg for ECC errors; correlate to DCGM thermals.
FP8 amax history all zeros	FP8 not actually enabled (TE not linked) or scale path bypassed.	Verify `--fp8 hybrid` accepted; check TE version compatibility; confirm via nvprof.
Pipeline bubble dominates step time	Too few micro-batches per pipeline fill (M < 4*K).	Raise --global-batch-size or lower --pipeline-model-parallel-size.
Per-rank loss diverges across DP group	DP rank received corrupted data shard or gradient AllReduce failed silently.	Enable --consistency-check; rerun preprocess; verify identical seeds across ranks.
Distributed Adam errors on resume	--use-distributed-optimizer flag flipped between save and load.	Always preserve optimiser flags across resumes; never mix sharded and replicated state.
Sequence parallel produces wrong gradients	Custom CUDA kernel assumes contiguous activation; SP shards it.	Rewrite kernel to take stride args, or disable SP for the affected block.
HuggingFace conversion produces gibberish	Tokenizer or weight-permutation mismatch (GQA, RoPE base, etc.).	Use the version-matched `tools/checkpoint/convert.py`; verify with a 10-token sanity prompt.
torchrun rendezvous fails on multi-node	MASTER_ADDR unreachable or firewall on port 29500.	Use c10d backend with explicit --rdzv_endpoint; open ports; verify hostname resolution.
DataLoader stalls at iteration boundary	Indexed dataset file on slow NFS; preprocess sharded for parallel reads.	Use Lustre or local NVMe; verify --num-workers; cache .idx files on each node.

Where this fits in the Yobitel stack#

References

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism · arXiv (Shoeybi et al., 2019)
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM · arXiv (Narayanan et al., 2021)
Reducing Activation Recomputation in Large Transformer Models · arXiv (Korthikanti et al., 2022)
Megatron-LM on GitHub · GitHub (NVIDIA)
Megatron Core documentation · NVIDIA
NVIDIA Transformer Engine · GitHub (NVIDIA)
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B · arXiv (Smith et al., 2022)

Megatron-LM

Overview#

Quick start#

How it works#

Reference and specifications#

Workload patterns#

Sizing and capacity planning#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

Megatron-LM

Overview#

Quick start#

How it works#

Reference and specifications#

Workload patterns#

Sizing and capacity planning#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel