TL;DR
- KV-cache memory-management algorithm introduced in vLLM by Kwon et al. (arXiv:2309.06180, September 2023) and presented at SOSP 2023.
- Splits each sequence's KV cache into fixed-size blocks (typically 16 or 32 tokens) allocated on demand from a global GPU memory pool, with a per-sequence block table mapping logical token positions to physical blocks — the exact mental model used by the operating-system virtual-memory page table.
- Eliminates the internal, external and reservation fragmentation that plagued pre-paging runtimes, lifting measured KV-cache memory utilisation from roughly 20-40% to above 95% and unlocking 2-4x higher serving throughput at the same hardware.
- Enables prefix sharing across sequences via content-addressed physical blocks (the foundation of prefix caching and SGLang's RadixAttention), and supports copy-on-write semantics for beam search and parallel sampling with negligible per-step overhead.
- Now standard across every major LLM serving runtime — vLLM, TensorRT-LLM (`paged_kv_cache`), SGLang (FlashInfer paged-KV), TGI, MLC-LLM and llama.cpp. The technique that made high-throughput LLM serving economically viable.
Overview#
PagedAttention is the KV-cache memory-management algorithm that turned LLM serving from a memory-bound throughput problem into a manageable scheduling one. It was introduced by Woosuk Kwon, Zhuohan Li and collaborators in the original vLLM paper (arXiv:2309.06180, September 2023, presented at SOSP 2023) and has since become the universal baseline — every modern LLM serving runtime implements either PagedAttention directly or an equivalent block-paged KV scheme with the same operational properties.
The reason it matters: in autoregressive transformer decoding, the KV cache (the per-layer, per-head key and value tensors that record what the model attended to at each previous token) is the dominant non-weight consumer of GPU memory at production batch sizes. For Llama 3.1 70B with grouped-query attention at FP8, each token of KV occupies roughly 40 KB; a single 32K-context request consumes ~1.3 GB before the request even starts producing tokens. How that memory is allocated, freed and shared dictates how many concurrent sequences a GPU can serve — which in turn dictates throughput and unit cost.
PagedAttention treats the KV cache the way an operating system treats process memory: as a logical address space mapped through a page table to physical frames that can live anywhere in physical memory. The result is the same elimination of fragmentation that made virtual memory the foundation of modern operating systems — applied to the GPU, at the granularity of attention blocks, with a per-sequence block table walked by the attention kernel itself. If you are deploying models on Yobibyte, this matters because Yobibyte's default inference engine (vLLM) and its opt-in variants (TensorRT-LLM, SGLang) all implement PagedAttention natively — the block-size, prefix-sharing and preemption-mode trade-offs discussed here are the same trade-offs the Yobibyte console exposes at workspace level.
This entry covers the algorithm, the variants and architectural choices that exist today, the runtimes that implement it, the trade-offs and known limitations, and the practical implementation notes that matter when you tune PagedAttention-based deployments in production. This entry helps you understand PagedAttention so you can pick runtimes, block sizes and KV-cache quantisation intelligently — whether you are tuning raw vLLM, TensorRT-LLM or SGLang on your own cluster, or comparing the engine variants Yobibyte exposes through its managed workspace.
How it works#
Before PagedAttention, LLM serving runtimes pre-allocated a contiguous KV-cache region for each in-flight sequence sized to the maximum possible context length. The waste came in three flavours: internal fragmentation (a sequence that completed at 200 tokens still owned memory for thousands of unused positions), external fragmentation (the contiguous-block requirement left awkward gaps that could not be reassembled into a usable allocation), and reservation overhead (memory had to be set aside for tokens that might never be produced, locking it out of the running batch). Measured KV-cache utilisation in pre-paging runtimes typically sat at 20-40%; in other words, more than half the available KV memory was wasted at any given moment.
PagedAttention applies the textbook OS virtual-memory solution. Each sequence has a logical KV cache addressed by token index, divided into fixed-size logical blocks (16 tokens is the default; 32 is sometimes preferred for long-context workloads). A per-sequence block table maps each logical block ID to a physical block in a single global GPU memory pool. When a sequence is admitted, it gets one block; when it grows past 16 tokens, it requests a second block from the pool; when it finishes, all its physical blocks return to the pool. There is no per-sequence pre-reservation, no contiguous-block requirement, and the only overhead is the small block-table indirection per attention step.
The attention kernel itself takes the block table as an additional input. For each attention step, the kernel reads the per-sequence block table to locate the physical blocks for that sequence's K and V tensors, gathers them with strided memory access, and computes the attention output. The original vLLM implementation used a custom CUDA kernel for this gather; modern implementations layer the paged-KV gather on top of FlashAttention-2 (Ampere) and FlashAttention-3 (Hopper / Blackwell), or use the FlashInfer paged-attention library. The kernel cost of the gather is in the low single-digit percent of the overall attention cost, and is more than compensated for by the 2-4x larger achievable batch size.
Prefix sharing across sequences falls out naturally from physical-block addressing. If two sequences begin with the same N-token system prompt, the first sequence prefills the prompt and writes the resulting KV blocks into the physical pool; the second sequence's block table can simply point to those same physical blocks and skip the prefill entirely. Reference counting on each physical block tracks how many sequences are using it; the block returns to the free pool only when its refcount drops to zero. This is the mechanism that makes prefix caching cheap and SGLang's cross-request RadixAttention possible.
Copy-on-write completes the picture for beam search and parallel sampling. When a sequence forks into multiple beams (or n parallel samples), all beams initially share the prefix blocks; their block tables point at the same physical blocks. The moment any beam writes a new token to the last block, that block is copied into a fresh physical frame and only that beam's block table is updated. Beams diverge only as they actually produce different tokens, not at the moment of fork.
- Logical KV cache: per-sequence, addressed by token index, divided into fixed-size blocks (default 16 tokens).
- Physical pool: single global GPU memory pool of equally-sized blocks, allocated and freed on demand.
- Block table: per-sequence indirection mapping logical block IDs to physical block IDs.
- Paged-attention kernel: walks the block table at each step to gather K and V from physical blocks; built on FlashAttention-2/3 or FlashInfer.
- Reference counting: each physical block tracks how many sequences map to it; freed when refcount drops to zero.
- Content-addressed blocks: identical token sequences hash to the same physical block, enabling automatic prefix sharing.
- Copy-on-write: forked sequences share blocks until a divergent write triggers a per-block copy.
Block size is the central tuning knob. Smaller blocks (8) reduce internal fragmentation but multiply block-table walk overhead. Larger blocks (32, 64) shrink the block table but waste more memory at sequence end. 16 is the universal default; 32 wins for long contexts; nothing else is generally worth investigating.
Variants and architectural choices#
Although every modern runtime implements PagedAttention conceptually, the specific design choices vary. The five axes that matter in practice are block size, the attention kernel that walks the block table, the prefix-sharing mechanism (intra-sequence only vs cross-request), the eviction policy when the pool fills, and whether the pool is GPU-only or extends to CPU swap.
Block size: 16 tokens is the upstream default in vLLM, TGI, MLC-LLM and llama.cpp. TensorRT-LLM defaults to 64 with 16 / 32 selectable; FlashInfer's paged-attention library supports 8 / 16 / 32 / 64 / 128. For long-context workloads (32K+) on H200 or B200, block size 32 typically wins by reducing the block-table walk frequency at the cost of slightly higher internal fragmentation. For short-context, very high-concurrency workloads (sub-1K contexts, batch sizes above 256), block size 16 wins by keeping fragmentation low.
Attention kernel: the original vLLM paged-attention kernel was a custom CUDA implementation written for the paper. By 2024 the standard had shifted to layering paged-KV gathers on top of FlashAttention-2 (Ampere and earlier) and FlashAttention-3 (Hopper, Blackwell), with FlashInfer (LMSYS, used by SGLang) offering an alternative implementation tuned for grouped-query attention and FP8 KV. TensorRT-LLM uses its own `paged_context_fmha` plugin built on similar primitives. All four are functionally equivalent; the choice is which library is best supported on a given accelerator.
Prefix sharing: intra-sequence prefix sharing (beam search, parallel sampling) is universal. Cross-request prefix sharing — letting two unrelated requests share blocks because their prompts happen to overlap — is implemented by vLLM (`--enable-prefix-caching`), TGI, TensorRT-LLM (`enable_kv_cache_reuse`) and most aggressively by SGLang (RadixAttention, where the entire pool is indexed by a radix tree and sharing happens automatically across every in-flight and recently-completed request). The sharing mechanism is the same; the difference is how cleverly the runtime finds matches.
Eviction policy: when the physical pool is full and a new sequence needs a block, something has to go. Least-recently-used (LRU) is the universal default. Some runtimes (SGLang in particular) bias eviction toward keeping high-fanout shared-prefix branches resident over single-sequence tails, which materially improves hit rates on multi-tenant workloads.
CPU swap: when the GPU pool fills up under pressure, the runtime can either swap (move blocks out to host memory and back when needed) or recompute (drop the sequence and re-run prefill when it gets scheduled again). vLLM supports both via `--preemption-mode swap|recompute` with `--swap-space` controlling the host budget; TGI defaults to recompute; SGLang prefers recompute. Swap is rarely the right choice on Hopper-class hardware where PCIe bandwidth is the bottleneck; recompute almost always wins.
Where it is used today#
By 2026 PagedAttention is the universal baseline. Every production-grade open-source LLM serving runtime implements it, and most cloud and enterprise inference services run on one of those runtimes. The table below summarises which runtime implements PagedAttention under which name, with what default block size, and what level of cross-request sharing. Yobibyte's default inference engine (vLLM) and its opt-in variants (TensorRT-LLM, SGLang) all sit on the paged-KV foundation — production workloads on Yobibyte and Yobitel NeoCloud customers running their own engines on H100, H200 and B200 capacity all rely on it.
Two things to note from the table. First, the universal adoption is what makes it safe to assume PagedAttention as the baseline — you do not need to argue for it any more than you need to argue for using virtual memory in an operating system. Second, the differentiation between runtimes is no longer the algorithm; it is the cross-request sharing strategy on top of it. SGLang's RadixAttention and vLLM's prefix caching are both built on the same paged-KV foundation; the difference is how the runtime indexes and reuses physical blocks across unrelated requests.
| Runtime | Term used | Default block size | Cross-request sharing | Kernel |
|---|---|---|---|---|
| vLLM | PagedAttention | 16 | Optional via --enable-prefix-caching | FlashAttention-2/3 + custom paged kernel |
| TensorRT-LLM | Paged KV cache | 64 (configurable) | Via enable_kv_cache_reuse | paged_context_fmha plugin |
| SGLang | RadixAttention (cross-request) | 16 | Default — full radix-tree sharing | FlashInfer paged-KV |
| TGI (Hugging Face) | Paged attention | 16 | Optional | FlashAttention-2/3 |
| MLC-LLM | Paged KV cache | 16 | Limited | TVM-generated kernels |
| llama.cpp | KV cache slots | Slot-based (variable) | Within slot | GGML kernels |
| DeepSpeed-MII | Blocked KV cache | 16 | Within engine | FlashAttention |
| NVIDIA NIM | Paged KV cache | Inherited from TRT-LLM | Yes | TRT-LLM plugin |
Trade-offs and known limitations#
PagedAttention is almost free in absolute terms but the specific design choices matter in production. The per-step block-table walk adds a few microseconds of latency per attention layer, scaling with sequence length and the number of layers — measurable on synthetic micro-benchmarks but invisible at production batch sizes. The pool-allocation overhead can become noticeable at extreme concurrency (thousands of in-flight sequences) because the free-list lookup grows linearly with the number of allocations per step; vLLM and SGLang both moved to bucketed free lists to keep this O(1) above a few hundred concurrent sequences.
Block size interacts non-trivially with KV-cache quantisation. FP8 KV at block size 16 packs cleanly into the FlashAttention-3 paged kernel; FP4 KV at block size 16 currently has a small alignment penalty and benefits from block size 32 on Blackwell. Block size also interacts with the page-aligned strides used by FlashInfer — switching from 16 to 8 on H100 typically loses 8-12% throughput because the smaller blocks fall off the optimised path in the kernel.
Cross-request prefix sharing has security implications that are sometimes missed. Two tenants on the same engine can in principle observe each other's prefix existence through cache-hit timing — if your prompt is in tenant A's recent history, tenant B will see a faster first-token latency when they submit the same prefix. For workloads where the prefix itself is confidential (clinical notes, legal drafts, identifiable PII in the system prompt), either disable prefix caching (`--enable-prefix-caching=false` in vLLM, `--disable-radix-cache` in SGLang) or shard tenants across separate engine processes.
Long-context workloads expose a different trade-off. At 128K context the block table itself becomes large (8,192 blocks at block size 16), and walking it every attention step is no longer free — the cost is real but still small (single-digit percent). The bigger issue is the underlying KV-cache size: 128K of Llama 3.1 70B FP8 KV is ~5GB per sequence, and even with PagedAttention's elimination of fragmentation, the available KV pool on a 4x H100 SXM5 box bounds you to ~32-48 concurrent 128K sequences before the pool runs out. Paging buys efficiency, not magic.
Practical implementation notes#
If you are operating runtimes that implement PagedAttention (which by 2026 is all of them), the tuning surface that matters is small. The five things to set in production are block size, the prefix-sharing toggle, the preemption mode, the KV-cache dtype and the GPU memory utilisation fraction. The Python snippet below shows the canonical configuration on vLLM; the equivalents in TensorRT-LLM, SGLang and TGI follow the same shape with renamed flags.
- Default block size 16 on Hopper; consider 32 on H200 / B200 for 32K+ context workloads.
- Always enable FP8 KV on H100+ — halves the per-token KV footprint at <0.1 EM regression.
- Set GPU memory utilisation to 0.90-0.92; above 0.95 activations crowd the KV pool and OOMs become unpredictable.
- Enable cross-request prefix sharing unless tenant isolation forbids it; the throughput win is free for workloads with prompt overlap.
- Preemption mode `recompute` beats `swap` on Hopper because PCIe bandwidth is the bottleneck for swap; only consider swap on Grace-Hopper or NVL-coherent systems where host memory is fast.
- Watch `gpu_cache_usage_perc` and `num_preemptions_total` as the paged-KV health metrics; preemptions in steady state mean the pool is undersized.
- For multi-tenant workloads with confidential prefixes, either run one engine per tenant or disable prefix caching and accept the throughput cost.
from vllm import LLM, SamplingParams
# Canonical paged-KV configuration for Llama 3.1 70B on 4x H100 SXM5
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
tensor_parallel_size=4,
# Block size: 16 default; bump to 32 for long-context workloads
block_size=16,
# KV dtype: fp8 halves the per-token KV footprint, doubles effective pool
kv_cache_dtype="fp8",
# GPU pool fraction: leaves headroom for activations and CUDA workspace
gpu_memory_utilization=0.92,
# Cross-request sharing: turn on for any workload with shared system prompts
enable_prefix_caching=True,
# Preemption mode: recompute almost always beats swap on Hopper
preemption_mode="recompute",
# Chunked prefill: interleave prefill chunks with decode; lower p99 latency
enable_chunked_prefill=True,
max_model_len=32768,
max_num_seqs=256,
)
# At runtime, watch the KV-pool metrics — these are the paged-KV equivalent of
# "memory pressure" from an OS standpoint:
# vllm:gpu_cache_usage_perc — should sit below ~0.92 in steady state
# vllm:num_preemptions_total — non-zero means the pool is full and sequences
# are being kicked out (recompute) or swapped
# vllm:prefix_cache_hit_rate — only relevant if enable_prefix_caching=True;
# expect 20-50% on chat, 60-90% on agent loopsThe single biggest mistake operators make is leaving `--enable-prefix-caching` off because they tested on a single-user benchmark and saw no improvement. Turn it on for any production workload with shared system prompts, few-shot examples or repeated tool scaffolds — the gain is 20-50% throughput with zero downside on those workloads.
Where this fits in the Yobitel stack#
PagedAttention is the algorithmic foundation of every inference engine that Yobitel runs in production. vLLM, TensorRT-LLM and SGLang — the three engines offered through the Yobibyte platform — all implement it natively, and the Yobitel scheduling and routing layers assume paged-KV semantics throughout. The platform's cost model uses the achieved KV-pool utilisation as a key efficiency metric, and the live capacity plans surfaced on the Yobibyte console derive their concurrency ceilings from the paged-KV budget at the chosen quantisation level.
For sovereign workloads on Yobitel London-1 and Frankfurt-1, the cross-request prefix sharing behaviour built on PagedAttention is explicitly part of the tenant-isolation contract. Customers running confidential workloads against NCSC OFFICIAL handling caveats can elect either single-tenant engines (with full RadixAttention sharing within the tenant) or shared engines with cross-request sharing disabled. The choice is exposed at the workspace level, not buried in engine flags.
References
- Efficient Memory Management for Large Language Model Serving with PagedAttention · arXiv (Kwon et al., 2023)
- vLLM PagedAttention implementation · GitHub (vllm-project)
- FlashAttention with paged KV cache · GitHub (Dao-AILab)
- FlashInfer paged-attention library · GitHub (FlashInfer)
- SGLang RadixAttention (cross-request sharing built on paged KV) · arXiv (Zheng et al., 2023)
- TensorRT-LLM paged_kv_cache reference · NVIDIA