SentencePiece Tokeniser

TL;DR

SentencePiece (Kudo & Richardson, 2018, arXiv:1808.06226) is a tokeniser that treats input as a raw Unicode string, including whitespace, and produces subword tokens directly — no language-specific preprocessing required.
It supports both BPE and Unigram language model tokenisation as backends.
Llama 1/2, T5, mT5, ALBERT, XLNet and most multilingual models use SentencePiece. Llama 3 moved to a tiktoken-style byte-level BPE for efficiency.
The hallmark sentinel '▁' (lower one-eighth block) represents a leading whitespace, making detokenisation reversible without manual rules.

Why a Language-Agnostic Tokeniser#

Earlier tokenisers assumed a pre-segmentation step — split on whitespace, then run BPE on the resulting words. That breaks down for Chinese, Japanese and Thai, which lack whitespace, and for any pipeline that wants to operate on raw text. SentencePiece dispenses with the pre-segmentation entirely: it operates on the raw string, including spaces, and learns its own subword units.

Whitespace is encoded with the meta-symbol ▁ (U+2581). 'Hello world' becomes ['▁Hello', '▁world']. Detokenisation is the reverse — concatenate tokens and replace ▁ with a space. Round-trip is exact.

BPE Mode vs Unigram Mode#

SentencePiece supports two tokenisation algorithms:

Unigram is the default for most SentencePiece deployments because it tends to produce slightly better tokenisations on multilingual data and allows subword regularisation — sampling alternative segmentations during training as a regulariser (Kudo, 2018).

BPE — same algorithm as Sennrich's: iterative pair merging. Deterministic, greedy at inference.
Unigram language model — initialise a large candidate vocabulary, then iteratively prune the least useful pieces (under EM-style optimisation) until a target size is reached. At inference, scores all possible segmentations and picks the highest-likelihood one (Viterbi).

Adoption#

Model	Mode	Vocab size
T5	Unigram	32,000
mT5	Unigram	250,112
ALBERT	Unigram	30,000
XLNet	Unigram	32,000
Llama 1	BPE	32,000
Llama 2	BPE	32,000
Gemma 1/2	BPE / SentencePiece	256,000

Why Llama 3 Moved Away#

Llama 3 switched from SentencePiece-BPE (32k) to a tiktoken-style byte-level BPE (128k). The reasons cited in the technical report: better non-English tokenisation efficiency, better code tokenisation, and tighter integration with the byte-level workflows OpenAI and others had standardised on. The tokeniser is no longer SentencePiece-based but the underlying BPE algorithm is the same.

Most new frontier models since 2024 have followed this pattern. SentencePiece remains in active use for multilingual research models, T5 derivatives and Gemma.

Using SentencePiece in Practice#

The C++ core is fast enough to tokenise hundreds of MB/s on a single thread, which matters for training-data preprocessing pipelines that may need to process trillions of tokens.

python

import sentencepiece as spm

# Train.
spm.SentencePieceTrainer.train(
    input="corpus.txt",
    model_prefix="my_tokeniser",
    vocab_size=32000,
    model_type="bpe",       # or "unigram"
    character_coverage=0.9995,
    pad_id=0, unk_id=1, bos_id=2, eos_id=3,
)

# Use.
sp = spm.SentencePieceProcessor(model_file="my_tokeniser.model")
ids = sp.encode("Hello, world!")            # [..., ..., ...]
text = sp.decode(ids)                       # "Hello, world!"

When fine-tuning a model whose tokeniser is SentencePiece, do not retrain or extend the tokeniser unless you also retrain the embedding rows for the new tokens. Adding tokens to a frozen tokeniser breaks the model.

References

SentencePiece: A simple and language independent subword tokenizer (Kudo & Richardson, 2018) · arXiv
Subword Regularization (Kudo, 2018) · arXiv
Llama 2 Technical Report · arXiv

Why a Language-Agnostic Tokeniser#

BPE Mode vs Unigram Mode#

SentencePiece supports two tokenisation algorithms:

BPE — same algorithm as Sennrich's: iterative pair merging. Deterministic, greedy at inference.

Unigram language model — initialise a large candidate vocabulary, then iteratively prune the least useful pieces (under EM-style optimisation) until a target size is reached. At inference, scores all possible segmentations and picks the highest-likelihood one (Viterbi).

Model

Mode

Vocab size

Unigram

32,000

mT5

Unigram

250,112

ALBERT

Unigram

30,000

XLNet

Unigram

32,000

Llama 1

BPE

32,000

Llama 2

BPE

32,000

Gemma 1/2

BPE / SentencePiece

256,000

Why Llama 3 Moved Away#

Most new frontier models since 2024 have followed this pattern. SentencePiece remains in active use for multilingual research models, T5 derivatives and Gemma.

Using SentencePiece in Practice#

The C++ core is fast enough to tokenise hundreds of MB/s on a single thread, which matters for training-data preprocessing pipelines that may need to process trillions of tokens.

python

import sentencepiece as spm

# Train.
spm.SentencePieceTrainer.train(
    input="corpus.txt",
    model_prefix="my_tokeniser",
    vocab_size=32000,
    model_type="bpe",       # or "unigram"
    character_coverage=0.9995,
    pad_id=0, unk_id=1, bos_id=2, eos_id=3,
)

# Use.
sp = spm.SentencePieceProcessor(model_file="my_tokeniser.model")
ids = sp.encode("Hello, world!")            # [..., ..., ...]
text = sp.decode(ids)                       # "Hello, world!"

SentencePiece Tokeniser

Why a Language-Agnostic Tokeniser#

BPE Mode vs Unigram Mode#

Adoption#

Why Llama 3 Moved Away#

Using SentencePiece in Practice#

References

Browse all entries

Deploy on Yobitel

SentencePiece Tokeniser

Why a Language-Agnostic Tokeniser#

BPE Mode vs Unigram Mode#

Adoption#

Why Llama 3 Moved Away#

Using SentencePiece in Practice#

References

Browse all entries

Deploy on Yobitel