TL;DR
- Scheduling pattern where the batch boundary sits at the token / iteration level rather than the request level — sequences enter and leave the running batch between every forward pass, not only at request start and end.
- Introduced as 'iteration-level scheduling' in Orca (Yu et al., OSDI 2022) and popularised in open source by vLLM (UC Berkeley, June 2023) under the name 'continuous batching'; NVIDIA TensorRT-LLM ships the same technique under 'in-flight batching'.
- Throughput gain over static batching: typically 5-15x on chat-shaped workloads, occasionally 20x+ when output-length variance is extreme; pairs structurally with PagedAttention to make admit/evict-driven KV-cache fragmentation a non-issue.
- Modern variant — chunked prefill — interleaves prefill chunks with decode tokens in the same iteration, eliminating the head-of-line blocking that single-shot prefill would cause when a new request arrives mid-batch.
- Every Yobitel Yobibyte inference workload runs under continuous batching; Omniscient Compute sizes `max-num-seqs`, KV-cache pool size and chunked prefill chunk size from the workspace's stated SLO (p95 TTFT, p99 latency, target tokens/sec) rather than asking the customer to hand-tune flags.
Overview#
Continuous batching is the scheduling technique that turned open-source LLM serving from a research curiosity into a production runtime. Before mid-2023, the conventional batched-inference pattern — pack N requests into a batch, run them in lockstep, wait for the slowest, repeat — left modern GPUs idle for the bulk of every batch because output lengths vary wildly between requests. A 50-token summary stuck behind a 2,000-token essay wastes 39 forward passes' worth of compute on padding. The Orca paper (Yu, Jeong, Kim, Chun; OSDI 2022) showed that moving the batch boundary inside the decoding loop — admitting and evicting sequences after every forward pass — recovers that wasted compute without any kernel-level change.
vLLM's June 2023 release was the inflection point: PagedAttention plus continuous batching shipped together as a single open-source runtime, and within six months every other serving stack (TGI, TensorRT-LLM, SGLang, MLC-LLM, MAX) had adopted the same pattern under one name or another. The naming is a quirk of history — Orca called it 'iteration-level scheduling', vLLM and SGLang say 'continuous batching', TensorRT-LLM says 'in-flight batching', TGI uses 'continuous batching' from the start — but the underlying technique is identical: the scheduler runs every iteration, not every request.
Through mid-2026 continuous batching is the table-stakes default; static batching is essentially gone from production LLM serving. The interesting engineering decisions have moved up the stack to chunked prefill chunk sizes, KV-cache eviction policies, preemption-versus-recompute trade-offs, and fairness disciplines across multi-tenant workloads. This entry helps you understand the technique well enough to reason about those decisions on a Yobitel Yobibyte workspace or on a self-hosted vLLM / TensorRT-LLM / SGLang deployment, and to size the GPUs and concurrency limits that consume it correctly.
How it works: from request-level to iteration-level scheduling#
A Transformer LLM produces output autoregressively: at every step the model consumes the entire prompt-plus-generated-so-far through the KV cache and emits one next token. Two distinct phases exist within a single request — prefill, where the full prompt is processed in one large parallel forward pass to fill the KV cache; and decode, where each new token requires one forward pass over the existing KV cache plus the new token's embedding.
Static batching pads every request to the longest output in the batch. A naive scheduler picks 16 requests, runs prefill on all 16 simultaneously, then runs decode steps until the longest sequence finishes — by which point the shorter sequences have been emitting padding tokens (or sleeping on a per-position mask) for hundreds of forward passes. KV-cache memory for finished sequences is held until the batch ends. Throughput collapses as output-length variance grows; on real chat traffic, static batching often realises less than 30 percent of the GPU's peak Tensor Core FLOPs.
Continuous batching changes the answer in one structural move: the scheduler runs after every forward pass, not at request boundaries. After each iteration the scheduler examines every sequence in the running batch; any that emitted an end-of-sequence token, hit a stop sequence or exhausted their max-tokens budget is evicted, its KV-cache blocks are returned to the pool, and the freed slot is filled from the waiting queue. The newly admitted sequence runs its prefill in the same iteration or the next, depending on whether chunked prefill is on.
From the GPU's perspective every iteration runs whatever batch the scheduler hands it; the kernel is unchanged. From the request's perspective generation proceeds normally; the only observable difference is that the per-request latency does not depend on the slowest sequence in the batch. From the operator's perspective the throughput-versus-latency Pareto curve shifts upward by 5-15x on typical chat shapes, with the gain growing as output-length variance grows.
- Scheduler runs every iteration (every forward pass), not every request — this is the defining structural change.
- Sequences enter the running batch when slots are free and KV-cache budget allows; they leave when they emit EOS, hit a stop token, exceed max-tokens, or are preempted.
- KV cache is the scheduler's primary budget — total GPU-resident KV blocks bound max concurrent sequences. Modern runtimes (vLLM, TGI, TensorRT-LLM) all manage KV cache through PagedAttention-style block pools to make admit/evict cycles cheap.
- Preemption: when memory pressure forces eviction of a still-running sequence, the runtime either recomputes (re-runs prefill on resumption — vLLM default) or swaps KV blocks out to CPU memory and back in on resumption (vLLM `--preemption-mode swap`, TGI default).
- Prefill / decode interaction: chunked prefill interleaves prefill chunks (default 512-2048 tokens) with decode tokens in the same iteration, eliminating head-of-line blocking when a long-prompt request arrives during high decode load.
- Token budget per step is bounded by `max-num-batched-tokens` in vLLM and equivalents in other runtimes; this knob trades prefill responsiveness against decode latency.
Continuous batching pairs structurally with PagedAttention. Without paged KV cache, the irregular admit/evict cycles would fragment memory faster than static batching ever did and the technique would crash on production traffic. The two are sold as inseparable for a reason — see the paged-attention entry.
Variants and architectural choices#
Three knobs define every continuous-batching deployment: max concurrent sequences, token budget per iteration, and the prefill / decode interleaving policy. Their interaction governs the throughput-versus-latency Pareto curve on which every serving stack operates.
| Knob | vLLM flag | Default | Effect | Tuning guidance |
|---|---|---|---|---|
| Max concurrent sequences | --max-num-seqs | 256 | Hard cap on running batch size | Raise until KV cache is the limiter; typical 256-1,024 for chat |
| Token budget per iteration | --max-num-batched-tokens | auto | Caps prefill + decode tokens per step | Lower to favour decode latency; raise to favour prefill throughput |
| Chunked prefill | --enable-chunked-prefill | true (v0.6+) | Interleave prefill chunks with decode | Almost always on; off only when prefill is the only workload |
| Chunk size | implicit in max-num-batched-tokens | ~512-2,048 tokens | Trades prefill latency for decode jitter | 1,024 is a good production default |
| Preemption mode | --preemption-mode | recompute | How memory pressure is handled | Recompute is simpler; swap helps when prefill is expensive |
| KV cache dtype | --kv-cache-dtype | auto | Halves cache memory at fp8 | fp8 on Hopper/Blackwell increases max concurrency 2x |
| Scheduling policy | (runtime-specific) | FCFS | Order in which waiting sequences are admitted | FCFS for chat; priority for multi-tenant SLOs |
When to use continuous batching versus alternatives#
Continuous batching is the right default for almost every LLM serving workload. The cases where it should be disabled or replaced are narrow: offline batch scoring where every prompt is the same length and there is no live traffic to multiplex (static batching is marginally faster because it avoids scheduler overhead), and embedding inference where the model has no decode phase at all (batched encode is the entire workload).
For multi-tenant serving where different customers have different SLOs, the question is which scheduling policy to layer on top of continuous batching, not whether to use it. First-come-first-served (the vLLM default) is simple and works for single-SLO workloads. Priority-based scheduling, weighted fair queueing and per-tenant rate limits are all built on top of the same iteration-level scheduler.
For prefill-heavy workloads — large-document summarisation, code review over long contexts, RAG with very long retrieved chunks — chunked prefill is the make-or-break setting. Without it, a single 64,000-token prompt arriving during high decode load monopolises the GPU for a full prefill (several seconds), causing p99 TTFT for every concurrent decoder to spike. With chunked prefill at 1,024-token chunks, the prefill interleaves with decode and p99 TTFT stays bounded.
Yobitel's Yobibyte platform runs every inference workload under continuous batching. Customers do not pick the scheduling policy; they declare the workspace SLO (p95 TTFT, p99 latency, throughput target) and Omniscient Compute sizes max-num-seqs, max-num-batched-tokens, KV cache dtype and chunked prefill chunk size to meet it. The internal selection logic is not exposed, but the resulting OpenAI-compatible endpoint behaviour — including the SLO it meets — is contracted to the customer.
Trade-offs and known limitations#
Continuous batching is not free. The scheduler runs at every iteration, so per-step Python overhead is non-trivial; on small models (1B-7B) at very high throughput, scheduler overhead can rival kernel time. Modern runtimes mitigate this with multi-step scheduling (vLLM's `--num-scheduler-steps`, default 1, typical production 8-16) — the worker executes N forward passes per scheduler invocation, amortising the Python cost.
Tail latency is more variable than under static batching. A new request admitted into a partially full batch shares GPU resources with whatever is already running; if the existing batch is decode-heavy and the new request triggers a large prefill, the existing decoders see a one-iteration latency bump. Chunked prefill smooths this but does not eliminate it.
Preemption is a sharp edge. When KV-cache memory exhausts under bursty load, the runtime must evict a still-running sequence. Recompute mode wastes the work already done on that sequence; swap mode adds PCIe round-trip latency on resume. Either way, the affected request's perceived latency spikes. Sizing for peak concurrency rather than steady-state is the operational lesson.
Fairness across tenants is non-trivial. Naive FCFS scheduling lets a chatty tenant monopolise the running batch by submitting many short requests; weighted fair queueing requires explicit tenant identification and quota tracking. Production multi-tenant stacks usually layer their own admission control on top of the runtime scheduler.
Prefill-heavy workloads with very long contexts (>128k tokens) approach a different limit: the prefill chunk itself is so large that even chunked prefill produces visible iteration jitter. At this scale, splitting the prefill across multiple GPUs via tensor or pipeline parallelism becomes necessary, and the scheduler's continuous-batching role becomes secondary to the distributed prefill execution.
If your p99 TTFT degrades under load while p50 looks fine, the most common cause is a chunked-prefill chunk size that is too large for the workload. Lower the chunk size (vLLM: lower `--max-num-batched-tokens`) until p99 TTFT recovers; the cost is slightly lower prefill throughput in steady state.
Practical implementation notes#
vLLM v0.8 ships continuous batching on by default; it is not a flag you turn on, it is the runtime. The flags above tune its behaviour rather than enable it. TensorRT-LLM exposes the same technique as `inflight_batching` in its engine build configuration. SGLang, TGI, MLC-LLM all expose equivalent knobs under slightly different names.
The snippet below shows the production-ready set of flags for a chat-shaped workload on 2x H100 SXM5 with FP8 KV cache. It is the same shape Yobibyte produces automatically for a customer who declares a chat SLO; the difference is that on Yobibyte the customer does not see, type or maintain these flags — they declare the SLO and the platform produces the configuration.
# Continuous batching, chunked prefill, FP8 KV cache — production chat shape
# Self-hosted; Yobibyte handles equivalent sizing from a workspace SLO
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768 \
\
# ---- Continuous batching knobs ----
--max-num-seqs 512 \
--max-num-batched-tokens 4096 \
--enable-chunked-prefill \
--num-scheduler-steps 16 \
--preemption-mode recompute \
\
# ---- Pair with PagedAttention / prefix cache / FP8 KV ----
--enable-prefix-caching \
--block-size 16 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92 \
\
--port 8000
# Observability — watch these Prometheus metrics together:
# vllm:gpu_cache_usage_perc (KV pool occupancy)
# vllm:num_requests_running (running batch size)
# vllm:num_requests_waiting (admission queue depth)
# vllm:time_to_first_token_seconds (TTFT histogram)
# vllm:e2e_request_latency_seconds (end-to-end latency)
# If gpu_cache_usage_perc is consistently >95 percent and num_requests_waiting
# is non-zero, the system is KV-cache bound — reduce max-num-seqs, switch to
# fp8 KV, enable prefix caching, or scale horizontally with another replica.Where continuous batching fits in the Yobitel stack#
Every inference workload on Yobitel's Yobibyte platform runs under continuous batching. The technique is not a customer-tunable flag on Yobibyte; it is structural to how the platform operates. Customers declare a workspace SLO (p95 TTFT, p99 latency, target tokens-per-second, peak concurrency) and Omniscient Compute selects the runtime configuration — including max-num-seqs, max-num-batched-tokens, KV cache dtype, chunked prefill chunk size and scheduling policy — to meet it. The customer's observable surface is the OpenAI-compatible endpoint and the workspace dashboard; the internal selection logic is not exposed.
Yobitel NeoCloud's H100 SXM5, H200 SXM5 and B200 SXM6 SKUs are sized in part around the continuous-batching working set: the inventory mix favours configurations where a single instance can host one or two large models with enough KV-cache budget to sustain hundreds of concurrent sequences. The Sizing analysis in the kv-cache and paged-attention entries assumes continuous batching is on; under static batching the same SKU would serve a fraction of the concurrency.
Yobitel's InferenceBench publishes tokens-per-second, time-to-first-token and p99 latency for each model / GPU / runtime combination under realistic continuous-batching load shapes (varied prompt lengths, varied output lengths, mixed concurrency). The benchmark methodology is fixed; the configurations are reproducible; the resulting Pareto curves let teams selecting a self-hosted runtime see what continuous batching delivers in practice rather than in theory. For teams choosing between Yobibyte managed serving and self-hosting, InferenceBench is the empirical bridge between the two.
References
- Orca: A Distributed Serving System for Transformer-Based Generative Models · USENIX OSDI 2022 (Yu et al.)
- Efficient Memory Management for Large Language Model Serving with PagedAttention · arXiv (Kwon et al., 2023, vLLM)
- vLLM Documentation · vLLM
- TensorRT-LLM In-Flight Batching · NVIDIA
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills · arXiv (Agrawal et al., 2023)