TL;DR
- ALiBi (Press et al., 2021, arXiv:2108.12409) replaces positional embeddings with a static, head-specific linear bias subtracted from the pre-softmax attention scores.
- The bias for a head with slope m at query-key distance |i − j| is −m · |i − j|, penalising far-away tokens proportional to distance.
- Different heads use different slopes (geometric series 2^(−8/h), 2^(−16/h), …), so some heads focus locally and others can still reach far.
- ALiBi extrapolates to sequences much longer than training without retraining. BLOOM, MPT and several Mosaic models adopted it.
The Idea in One Sentence#
Pre-softmax, add a fixed bias of −m · |i − j| to the attention score between query i and key j, where m is a constant slope chosen per attention head. No learned positional parameters, no embeddings added to the input — position is communicated entirely through that linear penalty.
Why Linear and Why a Per-Head Slope#
A linear penalty in distance is the simplest 'closer is better' inductive bias. It is also the unique form (under mild assumptions) where the relative ordering of attention weights is invariant under sequence shifts — important for translation invariance.
But a single slope is too rigid: some attention patterns need locality (the previous-token head), others need long-range mixing. ALiBi gives each head a different slope from a geometric series. With h heads, slope_i = 2^(−8 · i / h) for i = 1..h. The fast-decaying heads stay local; the slow-decaying heads can attend across the whole sequence.
Length Extrapolation#
Press et al.'s headline result is extrapolation: a model trained at 1,024 tokens evaluated at 16,384 tokens with ALiBi loses much less perplexity than the same model with sinusoidal or learned positions. Because the bias is a closed-form function of distance, it applies unchanged to any sequence length.
This made ALiBi popular in the early long-context era (2022-2023), before RoPE scaling techniques (NTK-aware, YaRN) matured. Today, RoPE with YaRN is the more common choice for new frontier models, but ALiBi remains attractive for its zero learned parameters and clean extrapolation.
ALiBi's extrapolation works without any continued pretraining. RoPE generally needs at least short continued training at the longer context. That simplicity is ALiBi's enduring advantage.
Implementation#
ALiBi is one tensor addition added to the pre-softmax score matrix. The bias matrix is constant — no gradients flow through it — so it can be precomputed once at model initialisation.
def alibi_bias(num_heads, seq_len, device):
slopes = torch.tensor(
[2 ** (-8.0 * (i + 1) / num_heads) for i in range(num_heads)],
device=device,
) # (h,)
pos = torch.arange(seq_len, device=device)
distance = (pos[None, :] - pos[:, None]).abs() # (n, n)
return -slopes[:, None, None] * distance # (h, n, n)
# scores shape: (h, n, n)
scores = scores + alibi_bias(h, n, scores.device)
weights = scores.softmax(dim=-1)Where It Is Used Today#
BLOOM (BigScience, 2022) — the largest open multilingual model of its era — used ALiBi. MPT-7B and MPT-30B (MosaicML, 2023) used ALiBi to support arbitrary context windows in deployment. Replit's code models adopted ALiBi for the same reason. Most new frontier models since 2024 use RoPE instead, but ALiBi is still chosen when ease of length extrapolation outweighs the small quality gap.
Trade-offs versus RoPE#
| Property | ALiBi | RoPE |
|---|---|---|
| Learned params | Zero | Zero |
| Extrapolation without training | Strong | Weak (needs YaRN/NTK) |
| Best-in-class short-context quality | Slightly behind | Slightly ahead |
| Implementation | One tensor add | Pairwise rotation |
| Used by frontier 2025+ models | Rare | Default |
References
- Train Short, Test Long: ALiBi (Press et al., 2021) · arXiv
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model · arXiv
- MPT-7B Technical Report (MosaicML) · MosaicML / Databricks