ALiBi (Attention with Linear Biases)

TL;DR

ALiBi (Press et al., 2021, arXiv:2108.12409) replaces positional embeddings with a static, head-specific linear bias subtracted from the pre-softmax attention scores.
The bias for a head with slope m at query-key distance |i − j| is −m · |i − j|, penalising far-away tokens proportional to distance.
Different heads use different slopes (geometric series 2^(−8/h), 2^(−16/h), …), so some heads focus locally and others can still reach far.
ALiBi extrapolates to sequences much longer than training without retraining. BLOOM, MPT and several Mosaic models adopted it.

The Idea in One Sentence#

Pre-softmax, add a fixed bias of −m · |i − j| to the attention score between query i and key j, where m is a constant slope chosen per attention head. No learned positional parameters, no embeddings added to the input — position is communicated entirely through that linear penalty.

Why Linear and Why a Per-Head Slope#

A linear penalty in distance is the simplest 'closer is better' inductive bias. It is also the unique form (under mild assumptions) where the relative ordering of attention weights is invariant under sequence shifts — important for translation invariance.

But a single slope is too rigid: some attention patterns need locality (the previous-token head), others need long-range mixing. ALiBi gives each head a different slope from a geometric series. With h heads, slope_i = 2^(−8 · i / h) for i = 1..h. The fast-decaying heads stay local; the slow-decaying heads can attend across the whole sequence.

Length Extrapolation#

Press et al.'s headline result is extrapolation: a model trained at 1,024 tokens evaluated at 16,384 tokens with ALiBi loses much less perplexity than the same model with sinusoidal or learned positions. Because the bias is a closed-form function of distance, it applies unchanged to any sequence length.

This made ALiBi popular in the early long-context era (2022-2023), before RoPE scaling techniques (NTK-aware, YaRN) matured. Today, RoPE with YaRN is the more common choice for new frontier models, but ALiBi remains attractive for its zero learned parameters and clean extrapolation.

ALiBi's extrapolation works without any continued pretraining. RoPE generally needs at least short continued training at the longer context. That simplicity is ALiBi's enduring advantage.

Implementation#

ALiBi is one tensor addition added to the pre-softmax score matrix. The bias matrix is constant — no gradients flow through it — so it can be precomputed once at model initialisation.

python

def alibi_bias(num_heads, seq_len, device):
    slopes = torch.tensor(
        [2 ** (-8.0 * (i + 1) / num_heads) for i in range(num_heads)],
        device=device,
    )                                              # (h,)
    pos = torch.arange(seq_len, device=device)
    distance = (pos[None, :] - pos[:, None]).abs() # (n, n)
    return -slopes[:, None, None] * distance       # (h, n, n)

# scores shape: (h, n, n)
scores = scores + alibi_bias(h, n, scores.device)
weights = scores.softmax(dim=-1)

Where It Is Used Today#

BLOOM (BigScience, 2022) — the largest open multilingual model of its era — used ALiBi. MPT-7B and MPT-30B (MosaicML, 2023) used ALiBi to support arbitrary context windows in deployment. Replit's code models adopted ALiBi for the same reason. Most new frontier models since 2024 use RoPE instead, but ALiBi is still chosen when ease of length extrapolation outweighs the small quality gap.

Trade-offs versus RoPE#

Property	ALiBi	RoPE
Learned params	Zero	Zero
Extrapolation without training	Strong	Weak (needs YaRN/NTK)
Best-in-class short-context quality	Slightly behind	Slightly ahead
Implementation	One tensor add	Pairwise rotation
Used by frontier 2025+ models	Rare	Default

References

Train Short, Test Long: ALiBi (Press et al., 2021) · arXiv
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model · arXiv
MPT-7B Technical Report (MosaicML) · MosaicML / Databricks

Why Linear and Why a Per-Head Slope#

Length Extrapolation#

ALiBi's extrapolation works without any continued pretraining. RoPE generally needs at least short continued training at the longer context. That simplicity is ALiBi's enduring advantage.

Implementation#

ALiBi is one tensor addition added to the pre-softmax score matrix. The bias matrix is constant — no gradients flow through it — so it can be precomputed once at model initialisation.

python

def alibi_bias(num_heads, seq_len, device):
    slopes = torch.tensor(
        [2 ** (-8.0 * (i + 1) / num_heads) for i in range(num_heads)],
        device=device,
    )                                              # (h,)
    pos = torch.arange(seq_len, device=device)
    distance = (pos[None, :] - pos[:, None]).abs() # (n, n)
    return -slopes[:, None, None] * distance       # (h, n, n)

# scores shape: (h, n, n)
scores = scores + alibi_bias(h, n, scores.device)
weights = scores.softmax(dim=-1)

Where It Is Used Today#

Property

ALiBi

RoPE

Learned params

Zero

Extrapolation without training

Strong

Weak (needs YaRN/NTK)

Best-in-class short-context quality

Slightly behind

Slightly ahead

Implementation

One tensor add

Pairwise rotation

Used by frontier 2025+ models

Rare

Default

ALiBi (Attention with Linear Biases)

The Idea in One Sentence#

Why Linear and Why a Per-Head Slope#

Length Extrapolation#

Implementation#

Where It Is Used Today#

Trade-offs versus RoPE#

References

Browse all entries

Deploy on Yobitel

ALiBi (Attention with Linear Biases)

The Idea in One Sentence#

Why Linear and Why a Per-Head Slope#

Length Extrapolation#

Implementation#

Where It Is Used Today#

Trade-offs versus RoPE#

References

Browse all entries

Deploy on Yobitel