TL;DR
- tiktoken is OpenAI's open-source byte-level BPE tokeniser, written in Rust with Python bindings, optimised for high-throughput encoding.
- It ships pre-trained vocabularies used by GPT-2 (r50k_base), GPT-3.5/GPT-4 (cl100k_base) and GPT-4o (o200k_base).
- Encoding is roughly 3-6× faster than equivalent HuggingFace tokenizers on the same inputs, which matters for large-scale prompt counting and dataset preprocessing.
- The cl100k_base vocab introduced strong code-friendly tokenisation; o200k_base extended it to ~200k tokens with much better multilingual efficiency.
What tiktoken Is#
tiktoken is a tokenisation library, not a tokenisation algorithm — under the hood it is byte-level BPE, the same algorithm as GPT-2's. What makes it notable is the implementation: a Rust core that uses careful regex pre-tokenisation, hash-table merge lookups and SIMD-friendly byte handling to achieve very high throughput on long inputs.
It is also the official source of OpenAI's tokeniser vocabularies. If you need to count GPT-4 prompt tokens accurately or replicate GPT-4's exact tokenisation, tiktoken with the cl100k_base encoding is the canonical implementation.
Pre-trained Encodings#
Each successive encoding has been more efficient on common inputs. cl100k_base is roughly 30 per cent more efficient than r50k_base on English-and-code; o200k_base adds another ~20 per cent on multilingual and code-heavy inputs by allocating tokens to common patterns in those domains.
| Encoding | Vocab size | Models |
|---|---|---|
| r50k_base / gpt2 | 50,257 | GPT-2, GPT-3 (davinci-1) |
| p50k_base | 50,281 | GPT-3 code-davinci |
| cl100k_base | 100,277 | GPT-3.5, GPT-4, text-embedding-3 |
| o200k_base | 200,019 | GPT-4o, GPT-4.1, o-series |
Using tiktoken#
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
# Or by model name:
enc = tiktoken.encoding_for_model("gpt-4")
ids = enc.encode("Hello, world!") # [9906, 11, 1917, 0]
text = enc.decode(ids) # "Hello, world!"
print(len(enc.encode(long_text))) # token count
# Many threads:
ids_batch = enc.encode_batch(["a", "b", "c"], num_threads=8)Use Beyond OpenAI Models#
Several open models have adopted tiktoken-compatible or tiktoken-style tokenisers. Llama 3 uses a byte-level BPE built with tiktoken's conventions, with a 128,256-token vocab. Qwen 2 and Qwen 3 use byte-level BPE with vocabs around 150k. DeepSeek-V3 uses ~129k. The interoperability is partial — exact merge tables differ — but the conventions (byte alphabet, regex pre-tokenisation, vocab sizes 100k+) are shared.
Performance#
On a single core, tiktoken encodes roughly 4-6 MB/s of text into tokens, several times faster than the equivalent Python or pure-Python tokenisers and meaningfully faster than HuggingFace tokenizers in most benchmarks. For dataset preprocessing where trillions of tokens must be encoded, that throughput difference translates to hours versus days of CPU time.
Always count tokens with the same tokeniser the model uses. cl100k_base and o200k_base produce different counts for the same text — sometimes by 20%. For accurate billing predictions, use encoding_for_model().
References
- tiktoken (GitHub) · OpenAI
- Language Models are Unsupervised Multitask Learners (GPT-2) · OpenAI
- Llama 3 Technical Report · arXiv