TL;DR
- SentencePiece (Kudo & Richardson, 2018, arXiv:1808.06226) is a tokeniser that treats input as a raw Unicode string, including whitespace, and produces subword tokens directly — no language-specific preprocessing required.
- It supports both BPE and Unigram language model tokenisation as backends.
- Llama 1/2, T5, mT5, ALBERT, XLNet and most multilingual models use SentencePiece. Llama 3 moved to a tiktoken-style byte-level BPE for efficiency.
- The hallmark sentinel '▁' (lower one-eighth block) represents a leading whitespace, making detokenisation reversible without manual rules.
Why a Language-Agnostic Tokeniser#
Earlier tokenisers assumed a pre-segmentation step — split on whitespace, then run BPE on the resulting words. That breaks down for Chinese, Japanese and Thai, which lack whitespace, and for any pipeline that wants to operate on raw text. SentencePiece dispenses with the pre-segmentation entirely: it operates on the raw string, including spaces, and learns its own subword units.
Whitespace is encoded with the meta-symbol ▁ (U+2581). 'Hello world' becomes ['▁Hello', '▁world']. Detokenisation is the reverse — concatenate tokens and replace ▁ with a space. Round-trip is exact.
BPE Mode vs Unigram Mode#
SentencePiece supports two tokenisation algorithms:
Unigram is the default for most SentencePiece deployments because it tends to produce slightly better tokenisations on multilingual data and allows subword regularisation — sampling alternative segmentations during training as a regulariser (Kudo, 2018).
- BPE — same algorithm as Sennrich's: iterative pair merging. Deterministic, greedy at inference.
- Unigram language model — initialise a large candidate vocabulary, then iteratively prune the least useful pieces (under EM-style optimisation) until a target size is reached. At inference, scores all possible segmentations and picks the highest-likelihood one (Viterbi).
Adoption#
| Model | Mode | Vocab size |
|---|---|---|
| T5 | Unigram | 32,000 |
| mT5 | Unigram | 250,112 |
| ALBERT | Unigram | 30,000 |
| XLNet | Unigram | 32,000 |
| Llama 1 | BPE | 32,000 |
| Llama 2 | BPE | 32,000 |
| Gemma 1/2 | BPE / SentencePiece | 256,000 |
Why Llama 3 Moved Away#
Llama 3 switched from SentencePiece-BPE (32k) to a tiktoken-style byte-level BPE (128k). The reasons cited in the technical report: better non-English tokenisation efficiency, better code tokenisation, and tighter integration with the byte-level workflows OpenAI and others had standardised on. The tokeniser is no longer SentencePiece-based but the underlying BPE algorithm is the same.
Most new frontier models since 2024 have followed this pattern. SentencePiece remains in active use for multilingual research models, T5 derivatives and Gemma.
Using SentencePiece in Practice#
The C++ core is fast enough to tokenise hundreds of MB/s on a single thread, which matters for training-data preprocessing pipelines that may need to process trillions of tokens.
import sentencepiece as spm
# Train.
spm.SentencePieceTrainer.train(
input="corpus.txt",
model_prefix="my_tokeniser",
vocab_size=32000,
model_type="bpe", # or "unigram"
character_coverage=0.9995,
pad_id=0, unk_id=1, bos_id=2, eos_id=3,
)
# Use.
sp = spm.SentencePieceProcessor(model_file="my_tokeniser.model")
ids = sp.encode("Hello, world!") # [..., ..., ...]
text = sp.decode(ids) # "Hello, world!"When fine-tuning a model whose tokeniser is SentencePiece, do not retrain or extend the tokeniser unless you also retrain the embedding rows for the new tokens. Adding tokens to a frozen tokeniser breaks the model.