TL;DR
- Splits the parameter matrices of each transformer block (Q/K/V projections and MLP up/down weights) across N GPUs, with collective communication inside the layer to reassemble the full activation. Memory and compute scale O(1/N) inside the TP group.
- Introduced at scale by Megatron-LM (Shoeybi et al., 2019, arXiv:1909.08053) and now the standard intra-node parallelism strategy for every LLM training and inference stack — Megatron-LM, NeMo, FSDP2 + DTensor, vLLM, TensorRT-LLM, SGLang all use the same column-parallel / row-parallel decomposition.
- Communication-intensive: two AllReduces per transformer block per pass (or AllGather + ReduceScatter with sequence parallelism). Requires NVLink / NVSwitch bandwidth; TP=4-8 is the practical sweet spot. Yobitel NeoCloud TP=8 reference fits inside one HGX H100/H200/B200 baseboard.
- Yobibyte schedules TP groups onto NVLink-attached node boundaries only, so inference latency stays within the NVLink island and customer recipes do not silently stretch TP across slower InfiniBand fabric.
Overview#
Tensor parallelism shards the weights of a single layer across N GPUs, runs the layer collaboratively, and synchronises within the layer. Where data parallelism replicates the model, tensor parallelism partitions it — so a model whose parameters cannot fit on a single GPU can still be trained or served as a single logical replica. The forward and backward graph looks unchanged from outside the TP group; only the bytes-on-device picture changes.
The canonical formulation comes from the 2019 Megatron-LM paper. For a transformer block, the column-parallel and row-parallel decompositions of the MLP and attention projections are chosen such that exactly one AllReduce per block is required in the forward pass plus one in the backward pass — a careful arithmetic identity that hides communication cost behind matmul cost. Sequence parallelism (Megatron-LM v3, 2022) further replaces those two AllReduces with AllGather + ReduceScatter pairs at the same total byte cost, freeing activation memory along the sequence dimension.
Tensor parallelism is now ubiquitous: it lives inside every modern training framework (Megatron-LM, NeMo, DeepSpeed-Megatron, FSDP2 + DTensor, Colossal-AI) and every modern inference engine (vLLM, TensorRT-LLM, SGLang, Triton TensorRT-LLM backend). On Yobitel NeoCloud, the H100 SXM5 reference node exposes 8 GPUs connected by 900 GB/s NVLink and 3.6 TB/s NVSwitch — the topology TP was designed for. Yobibyte's managed inference and fine-tune services schedule TP groups onto NVLink-attached node boundaries only, so customers cannot accidentally configure TP=16 that straddles two physical nodes and pays InfiniBand latency on every transformer block.
This entry helps you decide when tensor parallelism is the right strategy, which TP size fits your model and hardware, how it composes with DP / PP / SP / CP / EP, and how to avoid the few but expensive mistakes (TP across InfiniBand, mismatched head counts, naive NCCL topology) that turn a working recipe into a bandwidth-bound disappointment.
How it works#
Consider the standard transformer MLP block Y = GeLU(X · A) · B, where A is hidden -> intermediate (typically 4x hidden, or 8/3x for SwiGLU) and B is intermediate -> hidden. Megatron's column-parallel decomposition splits A column-wise: each TP rank holds A_i, a contiguous slice of A's columns. Each rank then computes X · A_i locally — the input X is the same on every rank, the partial output is sharded across the intermediate dimension. GeLU is element-wise, so it applies to the local slice without communication. B is split row-wise (each rank holds B_i, a slice of B's rows). The local matmul Y_i = GeLU(X · A_i) · B_i produces a partial sum of the full output Y; a single AllReduce across the TP group sums these partial sums to produce Y on every rank.
The attention block follows the same logic. Q, K, V projections are column-parallel — each rank holds a subset of the attention heads (so for an 80-head Llama 3.1 70B with TP=8, each rank holds 10 heads). Attention is computed entirely locally on each rank's head subset. The output projection is row-parallel and the closing AllReduce sums across the TP group. With grouped-query attention, the KV head count must also be divisible by TP — for Llama 3.1 70B's 8 KV heads, TP=8 is the maximum that keeps one KV head per rank.
Sequence parallelism is an extension that shards the activations of LayerNorm, dropout, and residual additions along the sequence dimension. Without SP, those activations are replicated on every TP rank (because LayerNorm needs the full hidden dimension). With SP, the AllReduce around each MLP/attention block is replaced by AllGather (before the matmul that needs the full sequence) and ReduceScatter (after). The total bytes moved are unchanged, but the resident activation drops by N, often making the difference between OOM and 'comfortable' at long context.
Backward pass mirrors forward. The column-parallel matmul's backward needs an AllReduce of the input gradient (because the input X was replicated and contributed to every rank's partial output); the row-parallel matmul's backward needs the partial gradients reduced across the rank dimension. NCCL handles both as AllReduce primitives, and overlapping them with the next backward matmul is what determines whether a TP recipe sustains 50 percent of theoretical FLOPs or 30 percent.
- Column-parallel matmul: shard along output-feature dim; AllReduce gradient of input in backward.
- Row-parallel matmul: shard along input-feature dim; AllReduce output in forward.
- Pairing column-parallel up-projection with row-parallel down-projection means one AllReduce per MLP block — the magic that makes TP cheap.
- Attention heads shard naturally along the head dimension; GQA imposes the KV-head divisibility constraint on TP size.
- Sequence parallelism: shard LN / dropout / residual along sequence dim; AllReduce becomes AllGather + ReduceScatter at the same total cost.
- Vocabulary parallelism (sharding the output projection across vocab) is a separate but related optimisation; matters for >100k-token vocabularies (Llama 3 = 128K).
The reason TP is intra-node only is that the AllReduce happens twice per transformer block. For Llama 3.1 70B with hidden=8192 and 80 blocks in BF16, that is ~5.2 GB of collective traffic per forward pass per rank. Over NVLink (900 GB/s) it costs ~6 ms; over InfiniBand NDR (50 GB/s effective) it costs ~100 ms — the entire forward pass.
Variants and architectural choices#
The original Megatron formulation has been extended in several ways. The table below summarises the common variants and when each makes sense.
| Variant | What it adds | When to use |
|---|---|---|
| Megatron TP (1909.08053) | Column + row parallel matmuls with AllReduce per block. | Default; every modern stack implements this. |
| Sequence parallel (2205.05198) | Shards LN/dropout/residual along sequence; AllReduce -> AG+RS. | Always pair with TP at long context; free activation memory. |
| Async tensor parallel (2024) | Overlaps AllReduce with the next matmul via CUDA graph capture. | Megatron-LM `--tp-comm-overlap`; 5-10 percent throughput uplift. |
| Tensor parallel inference (vLLM/TRT-LLM) | Same decomposition, single forward pass, KV cache sharded along head dim. | Standard for serving any model > 1 GPU. |
| FSDP2 + DTensor TP | PyTorch-native TP composable with FSDP via 2D device mesh. | PyTorch-first stacks at 70B+ where FSDP alone is not enough. |
| TP with quantisation (FP8/INT8) | Per-tensor scaling propagates through the TP group. | Hopper FP8 training; AWQ/GPTQ inference at TP=4-8. |
| Expert parallelism (EP) | Shards MoE experts across an orthogonal axis to TP. | Mixtral / DeepSeek-V3 / Qwen-MoE — compose EP with TP inside attention. |
| Megatron vocab parallelism | Shards output projection across vocab dim. | Required for vocab > 128k to avoid OOM on the final logits matmul. |
When to use vs alternatives#
Tensor parallelism is the right choice when a single GPU cannot hold the model weights plus its activations in the chosen precision, the workload runs inside one NVLink-attached node, and the model architecture has heads and hidden dims that divide cleanly by the TP size. The standard recipe for 70B-class training on H100 is TP=8 inside each HGX baseboard, with DP or PP across nodes. For 405B and beyond, TP=8 stays the intra-node choice and PP scales across IB-connected nodes.
For serving, TP is the default for any model that does not fit on a single GPU — vLLM and TensorRT-LLM both use the Megatron decomposition. The Yobitel NeoCloud H100 single-node serving reference for Llama 3.1 70B is TP=8 in BF16 (or TP=4 with FP8/AWQ quantisation), which fits comfortably with KV cache headroom for ~16K context. For Mixtral 8x22B or DeepSeek-V3, expert parallelism composes with TP — see the Yobibyte managed inference path for how recipe topology is enforced.
Alternatives have clear domains. FSDP / DeepSpeed ZeRO-3 are simpler and tolerate cross-node fabric but become AllGather-bound past ~70B. Pipeline parallelism is the right across-node strategy because P2P sends tolerate InfiniBand latency. Sequence parallelism is not an alternative to TP — it composes with it. Context parallelism is for the sequence length axis. Expert parallelism is for MoE.
Do not cross InfiniBand boundaries with tensor parallelism unless you have measured. The AllReduce latency over IB-NDR is 10-50x higher than NVLink and will dominate the step time. Use pipeline parallelism across nodes instead. Yobibyte's scheduler enforces this constraint automatically; if you self-operate, validate that your TP world fits inside the NVLink island before scaling out.
Trade-offs and known limitations#
- Divisibility constraints: hidden_size, num_attention_heads, num_query_groups (for GQA), ffn_hidden_size, and vocab_size (for vocab-parallel) must all divide cleanly by TP. For Llama 3.1 70B (GQA = 8 KV heads), TP can be at most 8.
- Interconnect-bound at TP > 8 even on the best NVSwitch fabric — going to TP=16 by spanning two HGX baseboards adds NVLink Switch hops that double the AllReduce latency.
- TP forces the same batch shape across all TP ranks — no per-rank batch independence, unlike DP.
- Combining TP with quantisation requires per-tensor scaling factors that propagate through the TP group; off-the-shelf quantised checkpoints sometimes assume TP=1 and need conversion.
- Async TP (overlapped AllReduce) requires CUDA graph capture, which conflicts with certain dynamic-shape recipes and selective recomputation patterns.
- TP changes the per-GPU communication pattern, which can interact badly with naively-configured NCCL topologies — set NCCL_TOPO_FILE on multi-node IB fabrics to keep the TP collectives rail-balanced.
Practical implementation notes#
PyTorch's native TP via `torch.distributed.tensor.parallel` (DTensor-based) reached production quality in PyTorch 2.x and is the recommended path for new PyTorch-native projects, especially when composing with FSDP2. Megatron-LM remains the production reference for large-scale training with TP+PP+SP+CP and FP8 via Transformer Engine. vLLM and TensorRT-LLM both implement TP for inference using the same Megatron decomposition — `tensor_parallel_size=8` on vLLM, `--tp_size 8` on TensorRT-LLM.
On Yobitel NeoCloud, the H100 SXM5 baseboard is the canonical TP=8 substrate; H200 SXM5 keeps the same NVLink topology with the same TP shapes but more HBM3e per rank (141 GB vs 80 GB), so TP=8 with KV cache for 32K context becomes practical without quantisation. B200 doubles HBM3e again and adds NVLink 5 (1.8 TB/s) which makes TP=8 even more comfortable. Yobibyte's managed inference path selects TP automatically per model and per region; the Yobibyte managed fine-tune service uses Megatron primitives under the hood and schedules TP onto NVLink-attached nodes by construction.
- Megatron-LM: `--tensor-model-parallel-size 8 --sequence-parallel --tp-comm-overlap` is the canonical training flag set.
- FSDP2 + DTensor: build a 2D mesh via `init_device_mesh("cuda", (dp, tp), mesh_dim_names=("dp", "tp"))` and parallelize_module with column/row-parallel styles.
- vLLM: `tensor_parallel_size=N` on `LLM(...)`; pairs with quantization=awq/gptq for memory-tight serving.
- TensorRT-LLM: `trtllm-build --tp_size N` at engine build; the engine pins the TP shape at compile time.
- NCCL: NCCL >= 2.20 for sm90+; NCCL_BUFFSIZE=8388608 for large TP AllReduces; NCCL_NVLS_ENABLE=1 on NVSwitch for in-network reduction.
- DCGM signals to watch: NVLINK_TX_BYTES (TP collectives), PIPE_TENSOR_ACTIVE (matmul utilisation), SM_ACTIVE (overall busyness).
Where tensor parallelism sits in the Yobitel stack#
Tensor parallelism is the intra-node parallelism axis on every Yobitel NeoCloud training pod and every Yobibyte managed inference deployment. NeoCloud's H100 SXM5, H200 SXM5, and B200 reference nodes expose 8 GPUs connected by NVSwitch — the topology TP was designed for — and the canonical TP=8 fits inside one baseboard. The NeoCloud pre-built training and inference containers ship Megatron-LM, NeMo, FSDP2, vLLM, TensorRT-LLM, and SGLang with TP recipes pre-validated against the fabric.
Yobibyte's managed inference and fine-tune services schedule TP groups onto NVLink-attached node boundaries only — the scheduler refuses to span a TP group across InfiniBand, which removes the most expensive footgun. Customers consuming Yobibyte do not pick TP themselves; the recipe is selected per model and per region by the platform. NeoCloud customers who self-operate (running Megatron-LM, NeMo, or vLLM directly inside their NeoCloud tenancy) keep full control of the TP shape and have access to the same fabric-tuned NCCL configuration.
References
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism · arXiv (Shoeybi et al., 2019)
- Reducing Activation Recomputation in Large Transformer Models (sequence parallelism) · arXiv (Korthikanti et al., 2022)
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM · arXiv (Narayanan et al., 2021)
- Megatron-LM on GitHub · GitHub (NVIDIA)
- PyTorch Tensor Parallelism documentation · PyTorch
- vLLM tensor parallel guide · vLLM docs