TL;DR
- GELU (Gaussian Error Linear Unit) by Hendrycks & Gimpel, 2016 (arXiv:1606.08415) defines f(x) = x · Φ(x) where Φ is the standard Gaussian CDF.
- It is smooth, non-monotonic for small negatives, and can be viewed as a stochastic regulariser that drops inputs with probability 1 − Φ(x).
- BERT, GPT-2, GPT-3, ViT and most early Transformers used GELU in their feed-forward blocks.
- Decoder-only LLMs from 2023 onward have largely moved to SwiGLU, but GELU remains common in encoder-only models, ViTs and embedding models.
Definition#
GELU(x) = x · Φ(x), where Φ(x) is the cumulative distribution function of a standard normal. Equivalently, GELU(x) = 0.5 · x · (1 + erf(x / √2)). Unlike ReLU's hard threshold at zero, GELU smoothly weights inputs by their CDF — positive values pass through nearly unchanged, large negative values are crushed to near zero, and small negatives get a small negative output.
The original paper presents an interpretation: GELU is the expected value of x under a stochastic gate that keeps x with probability Φ(x) and drops it otherwise. That makes GELU a kind of soft, input-aware dropout.
Why It Replaced ReLU in Transformers#
ReLU has a hard discontinuity in its derivative and a 'dead' zone for negative inputs where the gradient is exactly zero. In small networks this is fine; in deep Transformers the dead-zone phenomenon contributes to neuron underutilisation. GELU is smooth, has a non-zero gradient everywhere, and empirically gives small but consistent perplexity improvements over ReLU at iso-compute.
BERT (2018), GPT-2 (2019) and the Vision Transformer (2020) all used GELU. By 2021, GELU was the de-facto choice for Transformer FFNs.
Exact vs Approximate GELU#
The exact formula uses erf, which is expensive on GPU. Two cheap approximations are common:
Modern PyTorch and CUDA implementations use the exact erf form because fused kernels make it cheap enough; the approximations are mostly historical at this point.
- Tanh approximation: 0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³))).
- Sigmoid approximation (also called QuickGELU): x · sigmoid(1.702 · x). Used in OpenAI's CLIP and ViT-L/14.
Why Decoder-Only LLMs Moved On#
Noam Shazeer's 2020 'GLU Variants Improve Transformer' paper showed that gated variants — GeGLU (GELU-based) and SwiGLU (Swish-based) — outperform plain GELU FFNs at similar parameter counts. SwiGLU edged out GeGLU consistently, and from Llama onward became the standard.
GELU has not gone away, though. CLIP uses QuickGELU. DiT and FLUX use approximate GELU in their conditioning blocks. BERT-family embedding models (E5, BGE) still use GELU. It remains the second-most-common activation in modern Transformer code.
If you load an OpenAI CLIP checkpoint and your activations look slightly off, check whether your runtime is using QuickGELU (sigmoid form) or standard GELU (erf form). The two are similar but not identical.
Comparison with Swish/SiLU#
Swish(x) = x · sigmoid(x) — sometimes called SiLU — is closely related. It is non-monotonic for small negatives, smooth everywhere, and on most benchmarks performs within noise of GELU. SwiGLU's choice of Swish over GELU is a small architectural detail that empirically held up across many ablations.