TL;DR
- Server feature that collects multiple inbound requests and processes them as one batch on the accelerator.
- Distinct from continuous batching — dynamic batching is request-level (one batch per request set), while continuous batching is token-level (admits and evicts between iterations).
- Standard in Triton Inference Server, TorchServe, BentoML, KServe and most general-purpose model servers.
- Particularly effective for vision and embedding workloads where each request runs to completion as a fixed-shape forward pass.
Overview#
Dynamic batching is the classical model-server pattern: the server holds a small queue of inbound requests, gathers them into a batch when the queue is large enough (or after a short delay), and dispatches the batch to the model in a single forward pass. The result is amortised per-request overhead and higher GPU utilisation versus running every request individually.
The pattern predates LLM serving and is most associated with general-purpose servers like Triton Inference Server, TorchServe and BentoML. It remains the right tool for vision, embedding, classification, recommendation and other workloads where each request has a fixed shape and runs to completion.
Configuration Knobs#
- Maximum batch size — upper bound; the server never collects more than this.
- Preferred batch sizes — the server tries to dispatch at one of these sizes (e.g. 1, 4, 8, 16, 32) which may be tuned for kernel performance.
- Maximum queue delay — the longest the server waits before dispatching an undersized batch.
- Priority / preemption — optional policies for prioritising certain request classes.
Configuration Example#
In Triton, dynamic batching lives in `config.pbtxt` for each model. A vision model serving 30 ms inference might use a 50 ms max queue delay and preferred batch sizes of 4, 8 and 16 — collecting up to 16 requests but not waiting more than 50 ms.
dynamic_batching {
preferred_batch_size: [ 4, 8, 16 ]
max_queue_delay_microseconds: 50000
}
max_batch_size: 32Versus Continuous Batching#
Dynamic batching dispatches a single batch and waits for the whole batch to finish before the next dispatch. Continuous batching, used by LLM runtimes, admits and evicts at every iteration. For workloads where each request runs to completion in tens of milliseconds (vision, embeddings) dynamic batching is the right pattern; for autoregressive LLM workloads where requests last seconds and have wildly different output lengths, continuous batching dominates.
When a model server fronts both an LLM and a vision model, dynamic batching is configured for the vision model while the LLM runtime handles its own iteration-level scheduling internally.
Tuning#
Throughput rises with batch size up to the kernel's saturation point; latency rises with queue delay. The right balance is workload-specific. Useful starting points: preferred batch sizes that match the model's compute sweet spot (often powers of two), max queue delay short enough that p99 latency tolerance is preserved.
Profile under realistic load with the dispatch latency and batch-size histograms exposed by the server — those tell the truth about how well the configuration matches the actual traffic pattern.
When to Use#
Use dynamic batching for any vision, audio, embedding or classical-ML serving workload. For LLMs use continuous batching instead. Some servers (Triton in particular) let the two coexist on the same endpoint by routing each model through the appropriate scheduler.
References
- Triton Dynamic Batching Documentation · NVIDIA
- TorchServe Batch Inference · PyTorch
- BentoML Adaptive Batching · BentoML