tiktoken Tokeniser

TL;DR

tiktoken is OpenAI's open-source byte-level BPE tokeniser, written in Rust with Python bindings, optimised for high-throughput encoding.
It ships pre-trained vocabularies used by GPT-2 (r50k_base), GPT-3.5/GPT-4 (cl100k_base) and GPT-4o (o200k_base).
Encoding is roughly 3-6× faster than equivalent HuggingFace tokenizers on the same inputs, which matters for large-scale prompt counting and dataset preprocessing.
The cl100k_base vocab introduced strong code-friendly tokenisation; o200k_base extended it to ~200k tokens with much better multilingual efficiency.

What tiktoken Is#

tiktoken is a tokenisation library, not a tokenisation algorithm — under the hood it is byte-level BPE, the same algorithm as GPT-2's. What makes it notable is the implementation: a Rust core that uses careful regex pre-tokenisation, hash-table merge lookups and SIMD-friendly byte handling to achieve very high throughput on long inputs.

It is also the official source of OpenAI's tokeniser vocabularies. If you need to count GPT-4 prompt tokens accurately or replicate GPT-4's exact tokenisation, tiktoken with the cl100k_base encoding is the canonical implementation.

Pre-trained Encodings#

Each successive encoding has been more efficient on common inputs. cl100k_base is roughly 30 per cent more efficient than r50k_base on English-and-code; o200k_base adds another ~20 per cent on multilingual and code-heavy inputs by allocating tokens to common patterns in those domains.

Encoding	Vocab size	Models
r50k_base / gpt2	50,257	GPT-2, GPT-3 (davinci-1)
p50k_base	50,281	GPT-3 code-davinci
cl100k_base	100,277	GPT-3.5, GPT-4, text-embedding-3
o200k_base	200,019	GPT-4o, GPT-4.1, o-series

Using tiktoken#

python

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
# Or by model name:
enc = tiktoken.encoding_for_model("gpt-4")

ids = enc.encode("Hello, world!")            # [9906, 11, 1917, 0]
text = enc.decode(ids)                        # "Hello, world!"
print(len(enc.encode(long_text)))             # token count

# Many threads:
ids_batch = enc.encode_batch(["a", "b", "c"], num_threads=8)

Use Beyond OpenAI Models#

Several open models have adopted tiktoken-compatible or tiktoken-style tokenisers. Llama 3 uses a byte-level BPE built with tiktoken's conventions, with a 128,256-token vocab. Qwen 2 and Qwen 3 use byte-level BPE with vocabs around 150k. DeepSeek-V3 uses ~129k. The interoperability is partial — exact merge tables differ — but the conventions (byte alphabet, regex pre-tokenisation, vocab sizes 100k+) are shared.

Performance#

On a single core, tiktoken encodes roughly 4-6 MB/s of text into tokens, several times faster than the equivalent Python or pure-Python tokenisers and meaningfully faster than HuggingFace tokenizers in most benchmarks. For dataset preprocessing where trillions of tokens must be encoded, that throughput difference translates to hours versus days of CPU time.

Always count tokens with the same tokeniser the model uses. cl100k_base and o200k_base produce different counts for the same text — sometimes by 20%. For accurate billing predictions, use encoding_for_model().

References

tiktoken (GitHub) · OpenAI
Language Models are Unsupervised Multitask Learners (GPT-2) · OpenAI
Llama 3 Technical Report · arXiv

What tiktoken Is#

Pre-trained Encodings#

Encoding	Vocab size	Models
r50k_base / gpt2	50,257	GPT-2, GPT-3 (davinci-1)
p50k_base	50,281	GPT-3 code-davinci
cl100k_base	100,277	GPT-3.5, GPT-4, text-embedding-3
o200k_base	200,019	GPT-4o, GPT-4.1, o-series

Using tiktoken#

python

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
# Or by model name:
enc = tiktoken.encoding_for_model("gpt-4")

ids = enc.encode("Hello, world!")            # [9906, 11, 1917, 0]
text = enc.decode(ids)                        # "Hello, world!"
print(len(enc.encode(long_text)))             # token count

# Many threads:
ids_batch = enc.encode_batch(["a", "b", "c"], num_threads=8)

Use Beyond OpenAI Models#

Performance#

tiktoken Tokeniser

What tiktoken Is#

Pre-trained Encodings#

Using tiktoken#

Use Beyond OpenAI Models#

Performance#

References

Browse all entries

Deploy on Yobitel

tiktoken Tokeniser

What tiktoken Is#

Pre-trained Encodings#

Using tiktoken#

Use Beyond OpenAI Models#

Performance#

References

Browse all entries

Deploy on Yobitel