TL;DR
- Layer Normalisation (Ba, Kiros & Hinton, 2016, arXiv:1607.06450) normalises an activation across its feature dimension to zero mean and unit variance, then applies learned scale γ and shift β.
- Unlike BatchNorm it is batch-size independent and works fine with batch size 1 and variable sequence lengths — the reason it became standard in RNNs and then Transformers.
- It is applied around the attention and FFN sub-blocks in every Transformer; pre-norm placement is now universal for deep stacks.
- Modern decoder-only LLMs have largely replaced it with RMSNorm for a small efficiency gain, but encoder models (BERT, embedding models) and many vision Transformers still use LayerNorm.
What LayerNorm Does#
Given an activation vector x of dimension d, LayerNorm computes the per-sample mean and variance across the d feature dimensions, normalises to zero mean and unit variance, then applies a learned per-channel affine transform: LN(x) = γ · (x − μ) / √(σ² + ε) + β, with μ = mean(x), σ² = var(x), both taken across the d axis.
The crucial property is that LN operates per-sample. There is no dependency on other samples in the batch, no running statistics to track, and no train/eval mode mismatch. That is what makes it work for sequence models where batch sizes and sequence lengths vary at every step.
Why Not BatchNorm#
BatchNorm normalises across the batch dimension at each feature. It works brilliantly for image classification with large fixed batch sizes but fails for sequence models for three reasons: variable sequence lengths break the statistics, autoregressive decoding has batch size 1, and the running-mean machinery interacts poorly with the variable shapes typical of NLP. LayerNorm avoids all of these by being a pure per-sample operation.
Pre-Norm versus Post-Norm#
The original Transformer placed LayerNorm after each sub-layer's residual addition: y = LN(x + SubLayer(x)). This 'post-norm' placement is unstable for very deep networks — gradients through the residual stream are repeatedly normalised, and training often diverges past 12 layers without learning-rate warm-up.
Pre-norm — y = x + SubLayer(LN(x)) — places LayerNorm inside the residual branch, leaving the residual stream itself unnormalised. The residual stream then accumulates a clean signal across layers and gradient propagation stays well-conditioned. Every modern Transformer past ~24 layers uses pre-norm.
Switching a model from post-norm to pre-norm without retraining changes the function it computes — they are not equivalent. The choice has to be made at training time.
Implementation#
In practice the implementation is fused into a single CUDA kernel via cuDNN, Apex or torch.compile. Welford's algorithm is the standard for computing mean and variance in a single pass with good numerical stability.
class LayerNorm(nn.Module):
def __init__(self, d, eps=1e-5):
super().__init__()
self.gamma = nn.Parameter(torch.ones(d))
self.beta = nn.Parameter(torch.zeros(d))
self.eps = eps
def forward(self, x):
mu = x.mean(-1, keepdim=True)
var = x.var(-1, keepdim=True, unbiased=False)
return self.gamma * (x - mu) / (var + self.eps).sqrt() + self.betaTheoretical Picture#
LayerNorm makes the loss landscape smoother and bounds the magnitude of activations independently per layer. The mean subtraction enforces a zero-mean constraint that decouples the bias term from scale. The variance normalisation enforces scale invariance. The combination empirically removes most of the dynamic range issues that otherwise plague deep networks.
Modern analyses (Xiong et al., 2020) show that with pre-norm placement, the gradient norm of an L-layer Transformer scales as O(1/√L) instead of O(L) — i.e. gradients shrink gracefully with depth, which is exactly what allows hundred-layer stacks to train without exploding gradients.
Where It Still Lives#
Modern decoder-only LLMs use RMSNorm (Llama, Mistral, Qwen, DeepSeek). But LayerNorm remains the default in BERT-family encoders, the embedding models built on top (E5, BGE, GTE), most Vision Transformers (ViT, DINO, SigLIP), Diffusion Transformers (DiT, FLUX) and the speech models built on Conformer architectures.