TensorRT-LLM

TL;DR

Open-source LLM inference library from NVIDIA, first released October 2023 under Apache 2.0, that compiles Transformer architectures into TensorRT engines for the lowest latency and highest throughput achievable on NVIDIA GPUs.
Pairs hand-tuned CUDA, CUTLASS and FlashAttention-3 kernels with the FP8 Transformer Engine, FP4 on Blackwell, in-flight batching, paged KV cache, speculative decoding (draft, Medusa, EAGLE-2, lookahead) and custom AllReduce to push H100, H200 and B200 utilisation toward theoretical limits.
Not a server — an engine compiler. The build step (`trtllm-build`) bakes batch size, sequence length, parallelism and precision into a per-GPU engine binary; production deployments host that engine behind Triton Inference Server using the `tensorrtllm_backend`, which provides HTTP / gRPC, scheduling, metrics and model versioning.
Throughput uplift over vLLM on the same Hopper hardware is typically 1.4-1.8x at matched latency budgets — on Llama 3.1 70B FP8 with TP=4 on H100 SXM5, ~6,200 sustained output tok/s versus vLLM's ~3,800, at p50 TTFT under 180 ms. Translates to roughly 40 percent lower $/M tokens for stable, slow-rotating models.
Offered inside Yobitel's Yobibyte platform as an opt-in performance variant for stable production endpoints; continuously scored against vLLM and SGLang by Omniscient Compute on InferenceBench across H100 SXM5, H200 and B200 tenancies.

Overview#

TensorRT-LLM is NVIDIA's official inference compiler and runtime for large language models. It sits on top of the long-standing TensorRT graph compiler and adds three things: a Python API specialised for Transformer blocks (attention, RMSNorm, RoPE, GQA, MoE), an extensive library of pre-built model definitions covering most production-relevant open weights, and a C++ batch manager that handles in-flight batching, paged KV cache, beam search and speculative decoding. The aim is to take a HuggingFace or NeMo checkpoint and emit a serialised engine that exploits every architectural feature the underlying NVIDIA silicon offers — FP8 on Hopper, FP4 and second-generation Transformer Engine on Blackwell, NVLink-aware collectives, tensor cores, MIG slices.

The crucial conceptual distinction from vLLM, SGLang or TGI is that TensorRT-LLM is not a server. It is a compiler that produces engine binaries plus a runtime that loads them. Production deployments almost always wrap that runtime in NVIDIA Triton Inference Server using the `tensorrtllm_backend`, which provides the HTTP and gRPC API surface, request queuing, model versioning, multi-instance support and Prometheus metrics. Separating compile and serve is the design choice that buys the last 30-40 percent of performance — kernel selection, autotuning and memory layout can all assume fixed shapes — and pays for it with operational complexity.

Where vLLM optimises for breadth and developer ergonomics (one Python command, every new model on day one), TensorRT-LLM optimises for the absolute floor of latency at the highest sustainable throughput on NVIDIA silicon. The library is Apache 2.0 but is tightly coupled to the CUDA toolchain (CUDA 12.4+, cuDNN 9.x, TensorRT 10.x, NCCL 2.21+) and is not portable to AMD ROCm, Intel Gaudi or AWS Neuron. By mid-2026 it ships as a Python wheel, an NGC container (`nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3`) and a stand-alone C++ runtime; the release cadence is roughly monthly with new model architectures lagging vLLM by one to four weeks. Yobibyte exposes TensorRT-LLM as an opt-in performance variant — Yobitel customers stand up a workload first on the vLLM default and promote stable production endpoints to a TensorRT-LLM-backed engine through the managed workspace, never touching `trtllm-build` or Triton config files directly.

This entry documents the production surface: the build CLI and Python API, the compilation pipeline and runtime mechanics, the parallelism strategies, the Triton deployment shape, the limits and quotas, the observability hooks, and the sizing, cost and migration models you need to operate TensorRT-LLM at scale on Yobitel and beyond. This entry helps you stand up TensorRT-LLM for production LLM serving with the right flags, sizing and operational practices — whether you are operating raw upstream on your own NVIDIA fleet or consuming TensorRT-LLM as the Yobibyte opt-in performance variant.

Quick start#

The example below takes Llama 3.1 70B from a HuggingFace checkpoint to an FP8 TensorRT engine sharded across 4x H100 SXM5, then serves it behind Triton with the TensorRT-LLM backend. Stage 1 converts the HF checkpoint into a TensorRT-LLM checkpoint with the desired tensor-parallel layout. Stage 2 runs `trtllm-build` to emit one engine binary per GPU rank with batch size, sequence length and precision baked in. Stage 3 packages the engines into a Triton model repository and launches `tritonserver`. Stage 4 hits the OpenAI-compatible endpoint exposed by the bundled `openai_frontend`.

bash

# 0. Install TensorRT-LLM (or use the NGC container)
pip install "tensorrt-llm==0.14.0" --extra-index-url https://pypi.nvidia.com

# 1. Convert HuggingFace checkpoint to TRT-LLM checkpoint (TP=4, FP8)
python examples/llama/convert_checkpoint.py \
    --model_dir ./llama-3.1-70b-instruct-hf \
    --output_dir ./ckpt/llama3-70b-fp8-tp4 \
    --dtype bfloat16 \
    --tp_size 4 \
    --use_fp8 \
    --calib_dataset cnn_dailymail \
    --calib_size 512

# 2. Build per-rank engines with shapes baked in
trtllm-build \
    --checkpoint_dir ./ckpt/llama3-70b-fp8-tp4 \
    --output_dir ./engines/llama3-70b-fp8-tp4 \
    --gemm_plugin fp8 \
    --use_fp8_context_fmha enable \
    --use_paged_context_fmha enable \
    --paged_kv_cache enable \
    --remove_input_padding enable \
    --use_fused_mlp enable \
    --max_batch_size 64 \
    --max_input_len 16384 \
    --max_output_len 2048 \
    --max_num_tokens 16384 \
    --kv_cache_free_gpu_mem_fraction 0.92 \
    --workers 4

# 3. Stage into a Triton model repository and launch
cp -r ./engines/llama3-70b-fp8-tp4/* triton_models/llama3_70b/1/
tritonserver --model-repository=./triton_models \
    --grpc-port=8001 --http-port=8000 --metrics-port=8002

# 4. Hit the OpenAI-compatible frontend
curl http://localhost:8000/v2/models/ensemble/generate \
    -H "Content-Type: application/json" \
    -d '{
      "text_input": "Summarise FlashAttention-3 in 2 lines.",
      "max_tokens": 128,
      "temperature": 0.0
    }'

Engine binaries are pinned to (model architecture, weight precision, tensor-parallel size, max batch, max input, max output, GPU SM, TensorRT version, plugin set). Change any of those and you rebuild. Treat engine builds as a first-class CI artefact, version them, and never hand-edit.

How it works#

A TensorRT-LLM workflow has three stages, executed once per (model, precision, layout, target GPU) tuple. First, a HuggingFace or NeMo checkpoint is converted into a TensorRT-LLM checkpoint via the model-specific `convert_checkpoint.py` — a sharded directory of weights plus a `config.json` that records vocabulary size, hidden dimension, number of layers, GQA group count, RoPE parameters and the parallelism layout. Second, `trtllm-build` consumes that checkpoint and a set of build-time hyperparameters (max batch, max input length, max output length, beam width, quantisation mode, plugin selection) and emits one serialised engine binary per GPU rank. Third, the engines are loaded into the TensorRT-LLM runtime — typically inside a Triton model instance — which exposes the model behind a request queue managed by the C++ batch manager.

At runtime the batch manager implements in-flight batching: every iteration, it admits new requests from the queue if the paged KV cache has free blocks, advances all running sequences by one decoded token, evicts completed sequences, and triggers the next forward pass. Forward execution uses FlashAttention-3 on Hopper and Blackwell (via the `paged_context_fmha` plugin), with a paged-KV variant that gathers K and V from the block pool through per-sequence block tables. The KV pool size is computed at engine load from `kv_cache_free_gpu_mem_fraction` x (device memory - weights - activation working set), divided by block size in tokens.

Tensor parallelism shards each weight matrix across GPUs within an NVLink island, using a custom AllReduce kernel optimised for small message sizes typical of decode steps (the upstream NCCL collective adds latency that matters when each step lasts a few hundred microseconds). Pipeline parallelism splits layers into stages across nodes, tolerating lower interconnect bandwidth. Expert parallelism handles MoE architectures by partitioning experts across ranks. The three can be composed: a DeepSeek-V3 671B deployment might run TP=8 + PP=2 + EP=8 across two H100 SXM5 nodes connected by 400 Gb/s InfiniBand.

Speculative decoding integrates at the batch-manager layer. The runtime supports three families: draft-target speculation (a smaller model proposes k tokens, the target verifies in one parallel forward), Medusa heads (additional decoder heads attached to the target predict multiple future tokens), and EAGLE-2 (a learned draft tree that adapts to acceptance rate). On chat workloads with Llama 3 8B drafting for 70B, end-to-end latency typically drops 1.6-2.4x at unchanged output quality.

In-flight batching — NVIDIA's term for continuous batching; admits and evicts sequences between iterations to keep tensor cores busy.
Paged KV cache — block-structured KV memory with prefix sharing, equivalent in spirit to vLLM's PagedAttention; enabled with `--paged_kv_cache enable`.
FP8 and FP4 quantisation — leverages Hopper Transformer Engine and Blackwell FP4 cores via per-tensor scaling factors learned during a calibration pass.
Speculative decoding — supports draft-target, Medusa heads, EAGLE-2 and lookahead decoding, configured at build time and tuned at runtime.
Tensor, pipeline and expert parallelism — partitions large models across NVLink islands (TP), nodes (PP) and MoE expert sets (EP).
Custom AllReduce — a latency-optimised collective for small tensor-parallel groups within an NVLink island; bypasses NCCL for sub-microsecond reductions.
Plugin system — fused kernels (`gemm_plugin`, `gpt_attention_plugin`, `rmsnorm_plugin`, `lookup_plugin`) compiled into the engine, selected per architecture.

Turn on `--paged_kv_cache enable`, `--use_paged_context_fmha enable`, `--remove_input_padding enable` and `--use_fused_mlp enable` together as your baseline. The combined uplift over a naive build on Llama-class architectures is typically 35-60 percent at the same SLO.

Reference and specifications#

Every long-lived TensorRT-LLM deployment is parameterised by the `trtllm-build` command line and the runtime config passed into the batch manager. The table below is the canonical reference for the build CLI surface as of TensorRT-LLM v0.14 (June 2026). Most flags accept `enable` / `disable` literals rather than booleans; that is intentional and mirrors the TensorRT plugin naming convention. Flags not listed here are either internal tuning knobs that the defaults handle correctly or experimental features documented in the upstream reference.

Flag	Type	Default	Description
--checkpoint_dir	path	(required)	Input TRT-LLM checkpoint produced by convert_checkpoint.py.
--output_dir	path	(required)	Output directory for per-rank engine files (one .engine per GPU).
--tp_size	int	1	Tensor-parallel degree; shards each weight matrix across N ranks within an NVLink island.
--pp_size	int	1	Pipeline-parallel degree; splits layers into stages across nodes.
--max_batch_size	int	1	Hard ceiling on concurrent sequences. Baked into the engine; cannot be raised at runtime.
--max_input_len	int	1024	Maximum prompt length in tokens. Drives KV cache and activation memory sizing.
--max_output_len	int	1024	Maximum generated length per sequence.
--max_num_tokens	int	auto	Tokens per iteration ceiling; controls prefill / decode mix under in-flight batching.
--max_beam_width	int	1	Maximum beam width for beam search; set to 1 for sampling-only workloads.
--gemm_plugin	string	auto	auto \| fp16 \| bf16 \| fp8 \| fp4 \| int8 \| int4_weight_only. Selects the fused matmul kernel.
--gpt_attention_plugin	string	auto	auto \| fp16 \| bf16 \| fp8. Selects the fused attention kernel.
--use_fp8	bool	false	Enables FP8 weights and activations via the Transformer Engine (Hopper / Blackwell).
--use_fp8_context_fmha	enable/disable	disable	Runs the prefill attention kernel in FP8 — major Hopper throughput lever.
--use_paged_context_fmha	enable/disable	disable	Pages the prefill attention to support sequences longer than the batched-tokens budget.
--paged_kv_cache	enable/disable	enable	Block-structured KV cache with prefix sharing. Effectively mandatory for production.
--kv_cache_free_gpu_mem_fraction	float	0.9	Fraction of free GPU memory dedicated to the KV pool after weights and activations.
--kv_cache_type	string	paged	paged \| continuous. Use paged unless instructed otherwise.
--use_inflight_batching	enable/disable	enable	Enables continuous (iteration-level) batching in the C++ batch manager.
--remove_input_padding	enable/disable	enable	Packs variable-length prompts into a single tensor; large prefill speedup.
--use_fused_mlp	enable/disable	disable	Fuses gate / up projections in SwiGLU MLPs; ~10-15 percent speedup on Llama-class models.
--multi_block_mode	enable/disable	enable	Splits long-context attention across multiple SM blocks; improves long-prompt prefill.
--enable_xqa	enable/disable	auto	Enables XQA (cross-query attention) kernel for GQA models at decode.
--speculative_decoding_mode	string	none	none \| draft_target \| medusa \| eagle \| lookahead. Selects the speculation strategy.
--max_draft_len	int	0	Maximum tokens per speculative step; pair with the draft-model engine.
--medusa_num_heads	int	0	Number of Medusa heads attached to the target model.
--workers	int	1	Parallel build workers; set to TPPPEP for fastest engine compilation.
--profiling_verbosity	string	layer_names_only	none \| layer_names_only \| detailed. Higher values inflate engine size.
--strongly_typed	enable/disable	enable	Strongly-typed network mode required for FP8 / FP4 builds.
--logits_dtype	string	float32	fp16 \| fp32. fp32 logits cost memory but improve sampling stability.
--gather_generation_logits	enable/disable	disable	Required for log-prob outputs; disable unless needed.
--lora_plugin	string	(off)	Enables multi-LoRA at runtime; pair with --max_lora_rank and --lora_target_modules.
--max_lora_rank	int	8	Maximum supported LoRA rank; 64 is the practical ceiling.

Runtime configuration (batch scheduler policy, max queue depth, chunked context size) is passed via the Triton `config.pbtxt` for the `tensorrtllm_backend` model, not via `trtllm-build`. Build-time flags shape the engine; runtime flags shape the batch manager. Mixing the two up is the most common operator error.

Workload patterns#

Three workload shapes cover the bulk of TensorRT-LLM production deployments: high-throughput chat on a 70B-class model with in-flight batching, long-context summarisation with paged context attention, and ultra-low-latency endpoints accelerated by EAGLE-2 speculative decoding. Each has its own preferred build configuration and Triton runtime policy. These are also the three promotion paths Yobibyte automates when a customer upgrades a workload from the vLLM default to the TensorRT-LLM opt-in variant — the build matrix, engine validation against the vLLM baseline and rollout behind the same OpenAI-compatible URL are what a team running raw upstream on Kubernetes signs up to operate themselves.

Pattern A — Llama 3 70B FP8 TP=4 chat endpoint with in-flight batching. Targets p50 TTFT under 180 ms and 6,000+ sustained output tok/s on a single 4-GPU H100 SXM5 node. Pattern B — long-context endpoint with 128K context window using paged context FMHA; trades peak throughput for the ability to process 100K-token prompts without blowing the activation budget. Pattern C — speculative decoding via EAGLE-2 for chat workloads where p99 inter-token latency matters more than raw throughput; pairs a small EAGLE head with the target 70B engine.

bash

# A — Llama 3 70B FP8 TP=4, in-flight batching, chat-shaped
trtllm-build \
    --checkpoint_dir ./ckpt/llama3-70b-fp8-tp4 \
    --output_dir ./engines/llama3-70b-chat \
    --gemm_plugin fp8 \
    --use_fp8_context_fmha enable \
    --paged_kv_cache enable \
    --use_inflight_batching enable \
    --remove_input_padding enable \
    --use_fused_mlp enable \
    --max_batch_size 128 \
    --max_input_len 8192 \
    --max_output_len 2048 \
    --max_num_tokens 16384 \
    --kv_cache_free_gpu_mem_fraction 0.92 \
    --workers 4

# B — 128K long-context with paged context FMHA (H200 preferred)
trtllm-build \
    --checkpoint_dir ./ckpt/llama3-70b-fp8-tp4 \
    --output_dir ./engines/llama3-70b-128k \
    --gemm_plugin fp8 \
    --use_fp8_context_fmha enable \
    --use_paged_context_fmha enable \
    --paged_kv_cache enable \
    --multi_block_mode enable \
    --max_batch_size 16 \
    --max_input_len 131072 \
    --max_output_len 4096 \
    --max_num_tokens 8192 \
    --kv_cache_free_gpu_mem_fraction 0.95

# C — EAGLE-2 speculative decoding for low p99 latency
# C1. Build the EAGLE-2 draft head separately
python examples/eagle/convert_checkpoint.py \
    --model_dir ./llama-3-eagle-head \
    --output_dir ./ckpt/llama3-70b-eagle \
    --tp_size 4

# C2. Build the target engine with EAGLE support
trtllm-build \
    --checkpoint_dir ./ckpt/llama3-70b-fp8-tp4 \
    --output_dir ./engines/llama3-70b-eagle-target \
    --gemm_plugin fp8 \
    --use_fp8_context_fmha enable \
    --paged_kv_cache enable \
    --use_inflight_batching enable \
    --speculative_decoding_mode eagle \
    --max_draft_len 5 \
    --max_batch_size 64 \
    --max_input_len 4096 \
    --max_output_len 1024

Pattern B's KV cache cost scales linearly with input length and quadratically with concurrency at very long context. A 128K-context 70B model at FP8 with 16 concurrent users consumes roughly 220 GB of KV — keep `max_batch_size` modest and prefer H200 (141 GB per GPU) over H100 (80 GB) for long-context endpoints.

Sizing and capacity planning#

TensorRT-LLM throughput is bounded first by KV-cache memory, then by tensor-core FLOPs, then by NVLink AllReduce bandwidth at TP > 4. The planning model below assumes Llama-family architectures with grouped-query attention, FP8 weights and KV, and the canonical chat / RAG / long-context / batch mix at 4K input / 256 output. Tokens-per-second figures are mid-range observed values from InferenceBench v3 sustained runs; treat them as planning anchors, not contractual.

Engine builds are the operational cost most teams under-estimate. A 70B FP8 TP=4 engine takes 15-25 minutes to build on the same hardware it serves; long-context variants can take 45 minutes. CI pipelines that maintain a heterogeneous fleet (H100 SXM5, H200, B200) need a build matrix per (model, precision, TP, max_input, max_output, GPU SM, TRT-LLM version). Plan for the build farm capacity, the registry storage, and the rebuild churn that follows every CUDA / driver upgrade.

Workload	Model	Recommended SKU	Concurrency	Output tok/s	Notes
Chat, low latency	Llama 3.1 8B FP8	1x H100 SXM5 80GB	64-128	6,400-8,200	TP=1, in-flight batching, fused MLP.
Chat, balanced	Llama 3.1 70B FP8	4x H100 SXM5	128-256	5,800-6,800	TP=4, p50 TTFT 160-180 ms.
Chat, high QPS	Llama 3.1 70B FP8	8x H100 SXM5	256-512	9,200-12,500	TP=8 within NVLink island.
Long context (128K)	Llama 3.1 70B FP8	4x H200 141GB	16-32	2,200-3,400	Paged context FMHA, multi-block mode.
MoE chat	Mixtral 8x22B FP8	8x H100 SXM5	192-384	7,800-10,200	TP=8 with expert parallelism.
MoE chat	DeepSeek-V3 671B FP8	16x H100 SXM5 (2 nodes)	256-512	4,200-6,400	TP=8 + PP=2 + EP=8, 400Gb IB.
EAGLE-2 speculative	Llama 3.1 70B FP8 + EAGLE	4x H100 SXM5	64-192	9,600-13,800	1.7x uplift over non-speculative.
Blackwell next-gen	Llama 3.1 70B FP4	4x B200	256-512	12,400-16,800	FP4 weights, FP8 KV, second-gen TE.
Offline batch	Llama 3.1 70B FP8	4x H100 SXM5	1024+	13,500-18,000	Disable streaming, max_batch 1024.
Edge inference	Llama 3.1 8B INT4 AWQ	1x L40S 48GB	16-32	1,800-2,400	AWQ INT4, FP16 KV.

When the same model needs to serve both short-interactive and long-context traffic, build two engines (one with `max_input_len=8192`, one with `max_input_len=131072`), host both behind Triton, and route at the gateway by prompt length. Single oversized engines waste KV budget on every request.

Limits and quotas#

TensorRT-LLM enforces hard limits at the engine boundary (baked in at build time) and soft limits at the runtime batch manager (configurable per Triton model instance). Operational ceilings (GPU memory, NCCL groups, file descriptors, /dev/shm) come from the host OS and CUDA runtime, identical to any other CUDA workload.

Limit	Default	Hard ceiling	How to raise
max_batch_size	1	Memory-bounded (typically 256-512)	Rebuild engine with higher --max_batch_size; revalidate KV budget.
max_input_len	1024	RoPE-limited (e.g. 128K Llama 3.1)	Rebuild with higher --max_input_len; pair with rope scaling and paged context FMHA.
max_output_len	1024	Memory-bounded	Rebuild with higher --max_output_len; budget KV growth.
max_num_tokens	auto	Activation memory	Rebuild with higher --max_num_tokens; raises prefill throughput cap.
max_beam_width	1	8 in practice	Rebuild; beam search cost grows linearly.
max_lora_rank	8	64	Rebuild with higher --max_lora_rank; activation matmul cost rises ~5 percent.
TP size (intra-node)	1	8 (NVLink)	Bounded by GPUs per NVLink island.
PP size (cross-node)	1	~32 in practice	Bounded by pipeline-bubble overhead and IB topology.
EP size (MoE)	1	Total expert count	Bounded by model architecture (e.g. 8 for Mixtral).
Engine binary size	n/a	~Half free GPU memory	Profile with --profiling_verbosity layer_names_only.
Concurrent inference requests / engine	max_batch_size + queue	Memory-bounded	Scale by adding Triton model instances or replicas.
Shared memory (NCCL)	/dev/shm	Container-defined	Mount /dev/shm >= 1 GB per worker; required for TP > 1.
File descriptors	1024	ulimit	ulimit -n 65536 inside the Triton container.

Every limit on this list except runtime queue depth and concurrent replicas requires an engine rebuild to change. That is the central operational trade-off of TensorRT-LLM: build cost up front in exchange for runtime determinism. Plan capacity with a safety margin or accept a rebuild on every limit change.

Observability#

TensorRT-LLM does not expose Prometheus metrics directly; observability flows through the Triton Inference Server hosting the engine. Triton's `/metrics` endpoint emits standard request counters, latency histograms and GPU utilisation per model instance, with the `tensorrtllm_backend` adding TRT-LLM-specific gauges for KV cache utilisation, paused requests, and active in-flight batch size. For deeper kernel-level analysis, NVTX markers embedded in the runtime allow Nsight Systems to capture per-iteration traces showing prefill, attention, MLP and AllReduce phases in microsecond detail.

The metrics worth alerting on in production are: time-to-first-token p95, inter-token latency p95, KV cache utilisation, paused request count (sequences evicted from the running batch), Triton inference queue duration, and GPU SM occupancy from DCGM. The following Prometheus rules cover the common failure modes for a TRT-LLM deployment behind Triton.

nv_inference_request_duration_us — Triton total request latency including queueing.
nv_inference_queue_duration_us — time the request spent waiting before scheduling.
nv_trt_llm_kv_cache_block_fraction — fraction of paged KV blocks in use; 0.95+ means capacity headroom is gone.
nv_trt_llm_paused_requests — sequences evicted from the running batch under KV pressure; non-zero is a capacity smell.
nv_trt_llm_active_request_count — current in-flight batch size; should track max_batch_size at steady-state load.
nv_trt_llm_time_to_first_token_ms — prefill latency; correlate with prompt-length histogram.
nv_trt_llm_inter_token_latency_ms — decode latency; should approach 1 / theoretical tok/s when batch is full.
DCGM_FI_DEV_SM_OCCUPANCY and DCGM_FI_DEV_GPU_UTIL — pair with TRT-LLM metrics to distinguish compute vs memory vs idle bottlenecks.

yaml

# Prometheus rules for a TensorRT-LLM deployment behind Triton
groups:
  - name: trtllm-sla
    interval: 30s
    rules:
      - alert: TRTLLMHighTimeToFirstToken
        expr: histogram_quantile(0.95,
                sum by (le, model) (
                  rate(nv_trt_llm_time_to_first_token_ms_bucket[5m]))) > 400
        for: 5m
        labels: { severity: warning, team: inference }
        annotations:
          summary: "TRT-LLM TTFT p95 above 400ms on {{ $labels.model }}"

      - alert: TRTLLMKVCachePressure
        expr: nv_trt_llm_kv_cache_block_fraction > 0.95
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Paged KV cache >95 percent full — pauses imminent"

      - alert: TRTLLMPausedRequestsSpike
        expr: increase(nv_trt_llm_paused_requests[5m]) > 10
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Pause rate climbing — capacity insufficient or runaway request"

      - alert: TRTLLMQueueBuildup
        expr: histogram_quantile(0.95,
                sum by (le, model) (
                  rate(nv_inference_queue_duration_us_bucket[5m]))) > 500000
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Triton queue p95 above 500ms — under-provisioned or stuck batch"

      - alert: TRTLLMSMUnderutilised
        expr: avg_over_time(DCGM_FI_DEV_SM_OCCUPANCY[5m]) < 0.30
              and rate(nv_inference_request_success[5m]) > 0
        for: 15m
        labels: { severity: info }
        annotations:
          summary: "GPU SM occupancy under 30 percent — investigate pipeline bubble or kernel selection"

Capture an Nsight Systems trace (`nsys profile -t cuda,nvtx,osrt tritonserver ...`) on a representative production run once per release. The NVTX markers around prefill, attention, MLP and AllReduce make the source of any latency regression obvious within minutes — kernel selection, NCCL collective, or Python overhead in the Triton frontend.

Cost and FinOps#

TensorRT-LLM cost economics are dominated by three levers: GPU rental rate (identical to vLLM on the same hardware), achieved tokens-per-second-per-GPU (typically 1.4-1.8x vLLM at matched latency), and engine-build amortisation. The 40 percent lower $/M tokens versus vLLM holds for stable production models where the engine survives weeks or months; for rapidly rotating models the build-and-validate cost erodes the saving. The table below uses Yobitel UK list pricing (June 2026) and InferenceBench v3 throughput anchors at 4K input / 256 output.

TensorRT-LLM wins versus vLLM on $/M tokens for any model that lives in production unchanged for more than two weeks. Below that horizon, vLLM's build-free deployment loop typically wins on total cost of ownership.
FP8 weights + FP8 KV + FP8 context FMHA is the highest $/M-tokens lever on Hopper; BF16 is roughly 1.7x more expensive at the same SLO.
Engine-build cost on a 70B FP8 TP=4 model is ~25 minutes on the same hardware that serves it. Build once, amortise across the engine lifetime; do not rebuild for trivial config changes.
FOCUS-conformant billing exports from Yobitel include `inference_engine=tensorrt-llm`, `model_name` and `engine_build_id` resource tags so $/M tokens can be sliced by tenant, model and engine generation.
Spot capacity is harder to operate with TRT-LLM than with vLLM: engine load takes 30-90 seconds on a fresh GPU, so pre-emption costs more wall-clock time. Reserve a small on-demand floor for spot-pre-empted traffic.

Configuration	GPU rate ($/h)	Sustained tok/s	$/M output tokens	Notes
1x H100 SXM5, Llama 3.1 8B FP8	$3.20	7,200	$0.12	TP=1, in-flight batching.
4x H100 SXM5, Llama 3.1 70B FP8	$12.40	6,200	$0.56	TP=4, fused MLP, FP8 context FMHA.
8x H100 SXM5, Llama 3.1 70B FP8	$24.80	11,400	$0.60	TP=8, intra-NVLink.
4x H200, Llama 3.1 70B 128K ctx	$16.80	2,800	$1.67	Long context tax; paged context FMHA.
4x B200, Llama 3.1 70B FP4	$22.00	14,800	$0.41	Blackwell FP4, second-gen TE.
4x H100, EAGLE-2 speculative	$12.40	10,800	$0.32	1.74x vs non-speculative same SKU.
4x H100 spot, Llama 3.1 70B	$6.20	5,000	$0.34	Spot interruption averaged in; engine load time per restart.
Hosted SaaS reference (GPT-4o mini class)	n/a	n/a	$0.60	List API price; comparison only.

Security and compliance#

TensorRT-LLM inherits Triton Inference Server's security surface: bearer-token or mTLS auth at the gateway (Envoy, NGINX, AWS ALB), per-tenant quotas enforced upstream of Triton, and network isolation following the standard pattern — inference pods have no public-internet egress, engines are pulled from a private artefact registry, and per-replica NetworkPolicy locks ingress to the gateway service account. Engine binaries are immutable artefacts; signing them at build time and verifying signatures at load time is the recommended supply-chain control.

Hardware-rooted confidentiality flows through NVIDIA Confidential Compute on H100, H200 and B200 — the same TEE mode that protects training workloads protects TRT-LLM inference. Weights stay encrypted in GPU memory; the host kernel and even the hypervisor cannot read them. Yobitel sovereign tenancies in London-1 and Frankfurt-1 enable Confidential Compute by default for regulated workloads; the latency cost is roughly 3-5 percent versus non-confidential mode at matched throughput.

Regulatory implications are model- and data-specific. For UK public-sector workloads the canonical path is Yobitel sovereign tenancies operating under NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat; the engine itself is a control-plane component on those tenancies. For EU GDPR, the engine processes prompt and completion data only in volatile GPU memory and the on-disk engine binary; ensure any logged inputs are masked at the Triton frontend. For US HIPAA workloads, run inside a BAA-covered VPC and disable request logging at the Triton layer; for FedRAMP-equivalent profiles, pin to the FIPS-validated CUDA build and use NIAP-approved cipher suites at the ingress.

Multi-tenant TRT-LLM deployments share the paged KV cache and prefix-cache hit table across tenants by default. If tenants must not see each other's system prompts via any side-channel, host one Triton model instance per tenant or disable prefix sharing at the batch-manager configuration and accept the throughput hit.

Migration and alternatives#

Most production migrations to TensorRT-LLM come from one of three origins: vLLM (chasing the last 30-40 percent of throughput on a stabilised model), raw HuggingFace `transformers.generate()` (often a 10-15x uplift), or a hosted SaaS API (when sovereignty, custom models or unit-cost economics demand self-hosting). The migration effort is dominated not by code but by engine-build CI, kernel autotuning per SKU, and validation that the compiled engine produces output equivalent to the source checkpoint within tolerance.

If you are already running vLLM on Kubernetes with the OpenAI-compatible API surface, the migration target is a Triton deployment with the OpenAI frontend enabled. The code block below shows the equivalent deployments side by side; you can roll TRT-LLM behind the same `Service` and shift traffic at the gateway once the engine build pipeline is green.

From	Migration effort	Throughput change	Operational notes
HuggingFace transformers.generate	Medium — build pipeline + API swap	10-15x faster	Eliminates Python serving loop; requires CI for engine builds.
vLLM	Medium — engine compile + Triton	1.4-1.8x faster	Gain min-latency; lose fast model rotation.
TGI (Text Generation Inference)	Medium — same OpenAI API	1.3-1.6x faster	Same compile-time discipline as TRT-LLM; gain NVIDIA kernel depth.
SGLang	Medium — compile + Triton	Comparable at chat; TRT-LLM wins long context	Lose RadixAttention prefix sharing; gain absolute-min latency.
OpenAI / Bedrock / Anthropic API	High — model substitution + ops	Variable	Gain control, sovereignty; absorb engine-build CI overhead.
NVIDIA NeMo Inference	Low — same stack family	Comparable	TRT-LLM is the NeMo Inference engine under the hood.

bash

# TRT-LLM behind Triton on Kubernetes with NVIDIA GPU Operator
kubectl apply -f - <<'YAML'
apiVersion: apps/v1
kind: Deployment
metadata: { name: llama3-70b-trtllm }
spec:
  replicas: 2
  selector: { matchLabels: { app: llama3-70b-trtllm } }
  template:
    metadata: { labels: { app: llama3-70b-trtllm } }
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
          args:
            - "tritonserver"
            - "--model-repository=/models"
            - "--grpc-port=8001"
            - "--http-port=8000"
            - "--metrics-port=8002"
          resources:
            limits: { nvidia.com/gpu: 4 }
          ports:
            - { containerPort: 8000, name: http }
            - { containerPort: 8001, name: grpc }
            - { containerPort: 8002, name: metrics }
          volumeMounts:
            - { name: engines, mountPath: /models }
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: engines
          persistentVolumeClaim: { claimName: llama3-70b-engines }
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 8Gi }
YAML

# Equivalent vLLM deployment for the same model (for migration comparison)
# vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
#     --tensor-parallel-size 4 --quantization fp8 --max-model-len 32768 \
#     --enable-prefix-caching --enable-chunked-prefill --port 8000

# Equivalent on AWS (bare p5 with NGC TRT-LLM container)
aws ec2 run-instances \
    --instance-type p5.48xlarge \
    --image-id "$(aws ec2 describe-images --owners amazon \
        --filters 'Name=name,Values=Deep Learning OSS Nvidia Driver AMI GPU PyTorch*' \
        --query 'sort_by(Images,&CreationDate)[-1].ImageId' --output text)" \
    --user-data "$(cat <<'EOF'
#!/bin/bash
docker run --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    -v /opt/engines:/models \
    nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 \
    tritonserver --model-repository=/models
EOF
)"

Migration teams routinely under-plan two costs: the engine-build CI farm (allocate GPUs of the target SKU for builds, not just for serving) and the output-equivalence validation suite (a few thousand prompts replayed through both engines, compared at a configurable token-level divergence threshold). Budget both before committing to the migration.

Troubleshooting#

The error table below covers the failure modes that account for roughly 80 percent of production TensorRT-LLM incidents observed on Yobitel-operated fleets and the upstream community tracker. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix; for build-time errors the fix usually means a fresh `trtllm-build` run rather than a runtime config change.

Symptom / Error	Cause	Fix
Engine load fails after driver upgrade with 'Plugin not found'	Plugin library ABI changed across CUDA / TRT versions.	Rebuild engines against the new TRT-LLM container; pin driver and TRT-LLM versions in CI.
torch.cuda.OutOfMemoryError on first request	kv_cache_free_gpu_mem_fraction too high; activations crowd KV pool.	Lower to 0.88; verify engine activation working set; ensure no other CUDA process on GPU.
NCCL hang on Triton startup with TP > 1	/dev/shm too small or NVIDIA_VISIBLE_DEVICES misordered.	Mount /dev/shm >= 8 GB; pin device ordering; set NCCL_DEBUG=INFO; verify NVLink topology with nvidia-smi topo -m.
Very slow first token after pod restart	Engine load and CUDA-graph capture on cold start.	Expected for first 30-90 s; pre-warm with a synthetic request before flipping traffic at the gateway.
FP8 accuracy regression vs FP16 baseline	Calibration set unrepresentative of production traffic.	Recalibrate on a 512-sample slice of real prompts; rebuild engine; revalidate against the equivalence suite.
Paused requests climbing in steady state	max_batch_size too high for KV-cache budget at observed prompt lengths.	Rebuild engine with lower max_batch_size, or raise kv_cache_free_gpu_mem_fraction; never exceed 0.95.
TTFT p95 spikes when long prompts arrive	Long prefill monopolises a forward pass; chunked context not enabled.	Rebuild with --use_paged_context_fmha enable; tune --max_num_tokens to 8192-16384.
Throughput drops after upgrading TRT-LLM minor version	Kernel autotuner picked a regression path.	Force kernel selection with --gemm_plugin fp8 --gpt_attention_plugin fp8; capture Nsight trace; file upstream issue.
EAGLE-2 throughput worse than non-speculative	Draft acceptance rate too low for the workload.	Halve --max_draft_len; switch draft to EAGLE head trained on closer distribution; measure accept rate via metric.
Multi-node deployment never reaches steady state	Pipeline bubble too large or NCCL over InfiniBand misconfigured.	Lower --pp_size; set NCCL_IB_HCA, NCCL_SOCKET_IFNAME; verify GPUDirect RDMA with nccl-tests.
LoRA adapter activation latency dominates	Too many --max_lora_rank or too many adapters resident.	Cap rank at 16-32 on H100; benchmark adapter activation matmul; consider per-tenant engine if uplift is large.
Engine binary refuses to load on a different GPU SKU	Engine is pinned to (model, precision, TP, GPU SM, TRT version).	Rebuild against the target SKU; engines do not cross H100 -> B200 or H100 -> L40S.

Where this fits in the Yobitel stack#

TensorRT-LLM is the opt-in low-latency variant inside Yobibyte, Yobitel's AI-native platform. Models in the Yobibyte catalogue land first as vLLM-backed endpoints — the fast path to a working API — and a workload can be promoted to a TensorRT-LLM variant once the model, precision and shape have stabilised in production. The promotion runs through Yobibyte's managed engine-build pipeline: customers point at the model and target SLO, the platform compiles the engine on the right SKU, validates output equivalence against the vLLM baseline, and rolls the new engine behind the same OpenAI-compatible URL. No build flags, registries or Triton config to manage by hand.

Omniscient Compute scores every TensorRT-LLM engine release continuously against vLLM and SGLang on InferenceBench across H100 SXM5, H200, B200 and (on the Yobitel-AMD partnership) MI300X tenancies — a strictly NVIDIA-only result on the MI300X line. The benchmark mix covers chat (4K input / 256 output), RAG (16K shared prefix / 512 output), long context (96K input / 4K output) and offline batch. Customers see live $/M tokens, p50 / p95 latency and sustained tok/s for every supported model in their region, so the choice between vLLM and TensorRT-LLM becomes a numbers-driven trade-off rather than a vendor pitch.

For UK and EU sovereign workloads, TensorRT-LLM runs on Yobitel London-1 and Frankfurt-1 inside tenancies aligned to NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat, with NVIDIA Confidential Compute enabled by default. The combination of a transparent open-source engine, sovereign hardware, hardware-rooted weight confidentiality and continuous published benchmarks is what lets Yobitel customers run sub-200 ms LLM endpoints in country without ceding either visibility or control to a hosted SaaS API.

References

TensorRT-LLM on GitHub · GitHub (NVIDIA)
TensorRT-LLM Documentation · NVIDIA
NVIDIA Transformer Engine · NVIDIA Developer
Triton Inference Server with TensorRT-LLM Backend · GitHub (Triton)
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision · arXiv (Shah et al., 2024)
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees · arXiv (Li et al., 2024)
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads · arXiv (Cai et al., 2024)
NVIDIA Confidential Computing on Hopper · NVIDIA Developer

TL;DR

Open-source LLM inference library from NVIDIA, first released October 2023 under Apache 2.0, that compiles Transformer architectures into TensorRT engines for the lowest latency and highest throughput achievable on NVIDIA GPUs.
Pairs hand-tuned CUDA, CUTLASS and FlashAttention-3 kernels with the FP8 Transformer Engine, FP4 on Blackwell, in-flight batching, paged KV cache, speculative decoding (draft, Medusa, EAGLE-2, lookahead) and custom AllReduce to push H100, H200 and B200 utilisation toward theoretical limits.
Not a server — an engine compiler. The build step (`trtllm-build`) bakes batch size, sequence length, parallelism and precision into a per-GPU engine binary; production deployments host that engine behind Triton Inference Server using the `tensorrtllm_backend`, which provides HTTP / gRPC, scheduling, metrics and model versioning.
Throughput uplift over vLLM on the same Hopper hardware is typically 1.4-1.8x at matched latency budgets — on Llama 3.1 70B FP8 with TP=4 on H100 SXM5, ~6,200 sustained output tok/s versus vLLM's ~3,800, at p50 TTFT under 180 ms. Translates to roughly 40 percent lower $/M tokens for stable, slow-rotating models.
Offered inside Yobitel's Yobibyte platform as an opt-in performance variant for stable production endpoints; continuously scored against vLLM and SGLang by Omniscient Compute on InferenceBench across H100 SXM5, H200 and B200 tenancies.

Overview#

Quick start#

bash

# 0. Install TensorRT-LLM (or use the NGC container)
pip install "tensorrt-llm==0.14.0" --extra-index-url https://pypi.nvidia.com

# 1. Convert HuggingFace checkpoint to TRT-LLM checkpoint (TP=4, FP8)
python examples/llama/convert_checkpoint.py \
    --model_dir ./llama-3.1-70b-instruct-hf \
    --output_dir ./ckpt/llama3-70b-fp8-tp4 \
    --dtype bfloat16 \
    --tp_size 4 \
    --use_fp8 \
    --calib_dataset cnn_dailymail \
    --calib_size 512

# 2. Build per-rank engines with shapes baked in
trtllm-build \
    --checkpoint_dir ./ckpt/llama3-70b-fp8-tp4 \
    --output_dir ./engines/llama3-70b-fp8-tp4 \
    --gemm_plugin fp8 \
    --use_fp8_context_fmha enable \
    --use_paged_context_fmha enable \
    --paged_kv_cache enable \
    --remove_input_padding enable \
    --use_fused_mlp enable \
    --max_batch_size 64 \
    --max_input_len 16384 \
    --max_output_len 2048 \
    --max_num_tokens 16384 \
    --kv_cache_free_gpu_mem_fraction 0.92 \
    --workers 4

# 3. Stage into a Triton model repository and launch
cp -r ./engines/llama3-70b-fp8-tp4/* triton_models/llama3_70b/1/
tritonserver --model-repository=./triton_models \
    --grpc-port=8001 --http-port=8000 --metrics-port=8002

# 4. Hit the OpenAI-compatible frontend
curl http://localhost:8000/v2/models/ensemble/generate \
    -H "Content-Type: application/json" \
    -d '{
      "text_input": "Summarise FlashAttention-3 in 2 lines.",
      "max_tokens": 128,
      "temperature": 0.0
    }'

How it works#

In-flight batching — NVIDIA's term for continuous batching; admits and evicts sequences between iterations to keep tensor cores busy.
Paged KV cache — block-structured KV memory with prefix sharing, equivalent in spirit to vLLM's PagedAttention; enabled with `--paged_kv_cache enable`.
FP8 and FP4 quantisation — leverages Hopper Transformer Engine and Blackwell FP4 cores via per-tensor scaling factors learned during a calibration pass.
Speculative decoding — supports draft-target, Medusa heads, EAGLE-2 and lookahead decoding, configured at build time and tuned at runtime.
Tensor, pipeline and expert parallelism — partitions large models across NVLink islands (TP), nodes (PP) and MoE expert sets (EP).
Custom AllReduce — a latency-optimised collective for small tensor-parallel groups within an NVLink island; bypasses NCCL for sub-microsecond reductions.
Plugin system — fused kernels (`gemm_plugin`, `gpt_attention_plugin`, `rmsnorm_plugin`, `lookup_plugin`) compiled into the engine, selected per architecture.

Reference and specifications#

Flag	Type	Default	Description
--checkpoint_dir	path	(required)	Input TRT-LLM checkpoint produced by convert_checkpoint.py.
--output_dir	path	(required)	Output directory for per-rank engine files (one .engine per GPU).
--tp_size	int	1	Tensor-parallel degree; shards each weight matrix across N ranks within an NVLink island.
--pp_size	int	1	Pipeline-parallel degree; splits layers into stages across nodes.
--max_batch_size	int	1	Hard ceiling on concurrent sequences. Baked into the engine; cannot be raised at runtime.
--max_input_len	int	1024	Maximum prompt length in tokens. Drives KV cache and activation memory sizing.
--max_output_len	int	1024	Maximum generated length per sequence.
--max_num_tokens	int	auto	Tokens per iteration ceiling; controls prefill / decode mix under in-flight batching.
--max_beam_width	int	1	Maximum beam width for beam search; set to 1 for sampling-only workloads.
--gemm_plugin	string	auto	auto \| fp16 \| bf16 \| fp8 \| fp4 \| int8 \| int4_weight_only. Selects the fused matmul kernel.
--gpt_attention_plugin	string	auto	auto \| fp16 \| bf16 \| fp8. Selects the fused attention kernel.
--use_fp8	bool	false	Enables FP8 weights and activations via the Transformer Engine (Hopper / Blackwell).
--use_fp8_context_fmha	enable/disable	disable	Runs the prefill attention kernel in FP8 — major Hopper throughput lever.
--use_paged_context_fmha	enable/disable	disable	Pages the prefill attention to support sequences longer than the batched-tokens budget.
--paged_kv_cache	enable/disable	enable	Block-structured KV cache with prefix sharing. Effectively mandatory for production.
--kv_cache_free_gpu_mem_fraction	float	0.9	Fraction of free GPU memory dedicated to the KV pool after weights and activations.
--kv_cache_type	string	paged	paged \| continuous. Use paged unless instructed otherwise.
--use_inflight_batching	enable/disable	enable	Enables continuous (iteration-level) batching in the C++ batch manager.
--remove_input_padding	enable/disable	enable	Packs variable-length prompts into a single tensor; large prefill speedup.
--use_fused_mlp	enable/disable	disable	Fuses gate / up projections in SwiGLU MLPs; ~10-15 percent speedup on Llama-class models.
--multi_block_mode	enable/disable	enable	Splits long-context attention across multiple SM blocks; improves long-prompt prefill.
--enable_xqa	enable/disable	auto	Enables XQA (cross-query attention) kernel for GQA models at decode.
--speculative_decoding_mode	string	none	none \| draft_target \| medusa \| eagle \| lookahead. Selects the speculation strategy.
--max_draft_len	int	0	Maximum tokens per speculative step; pair with the draft-model engine.
--medusa_num_heads	int	0	Number of Medusa heads attached to the target model.
--workers	int	1	Parallel build workers; set to TPPPEP for fastest engine compilation.
--profiling_verbosity	string	layer_names_only	none \| layer_names_only \| detailed. Higher values inflate engine size.
--strongly_typed	enable/disable	enable	Strongly-typed network mode required for FP8 / FP4 builds.
--logits_dtype	string	float32	fp16 \| fp32. fp32 logits cost memory but improve sampling stability.
--gather_generation_logits	enable/disable	disable	Required for log-prob outputs; disable unless needed.
--lora_plugin	string	(off)	Enables multi-LoRA at runtime; pair with --max_lora_rank and --lora_target_modules.
--max_lora_rank	int	8	Maximum supported LoRA rank; 64 is the practical ceiling.

Workload patterns#

bash

# A — Llama 3 70B FP8 TP=4, in-flight batching, chat-shaped
trtllm-build \
    --checkpoint_dir ./ckpt/llama3-70b-fp8-tp4 \
    --output_dir ./engines/llama3-70b-chat \
    --gemm_plugin fp8 \
    --use_fp8_context_fmha enable \
    --paged_kv_cache enable \
    --use_inflight_batching enable \
    --remove_input_padding enable \
    --use_fused_mlp enable \
    --max_batch_size 128 \
    --max_input_len 8192 \
    --max_output_len 2048 \
    --max_num_tokens 16384 \
    --kv_cache_free_gpu_mem_fraction 0.92 \
    --workers 4

# B — 128K long-context with paged context FMHA (H200 preferred)
trtllm-build \
    --checkpoint_dir ./ckpt/llama3-70b-fp8-tp4 \
    --output_dir ./engines/llama3-70b-128k \
    --gemm_plugin fp8 \
    --use_fp8_context_fmha enable \
    --use_paged_context_fmha enable \
    --paged_kv_cache enable \
    --multi_block_mode enable \
    --max_batch_size 16 \
    --max_input_len 131072 \
    --max_output_len 4096 \
    --max_num_tokens 8192 \
    --kv_cache_free_gpu_mem_fraction 0.95

# C — EAGLE-2 speculative decoding for low p99 latency
# C1. Build the EAGLE-2 draft head separately
python examples/eagle/convert_checkpoint.py \
    --model_dir ./llama-3-eagle-head \
    --output_dir ./ckpt/llama3-70b-eagle \
    --tp_size 4

# C2. Build the target engine with EAGLE support
trtllm-build \
    --checkpoint_dir ./ckpt/llama3-70b-fp8-tp4 \
    --output_dir ./engines/llama3-70b-eagle-target \
    --gemm_plugin fp8 \
    --use_fp8_context_fmha enable \
    --paged_kv_cache enable \
    --use_inflight_batching enable \
    --speculative_decoding_mode eagle \
    --max_draft_len 5 \
    --max_batch_size 64 \
    --max_input_len 4096 \
    --max_output_len 1024

Sizing and capacity planning#

Workload	Model	Recommended SKU	Concurrency	Output tok/s	Notes
Chat, low latency	Llama 3.1 8B FP8	1x H100 SXM5 80GB	64-128	6,400-8,200	TP=1, in-flight batching, fused MLP.
Chat, balanced	Llama 3.1 70B FP8	4x H100 SXM5	128-256	5,800-6,800	TP=4, p50 TTFT 160-180 ms.
Chat, high QPS	Llama 3.1 70B FP8	8x H100 SXM5	256-512	9,200-12,500	TP=8 within NVLink island.
Long context (128K)	Llama 3.1 70B FP8	4x H200 141GB	16-32	2,200-3,400	Paged context FMHA, multi-block mode.
MoE chat	Mixtral 8x22B FP8	8x H100 SXM5	192-384	7,800-10,200	TP=8 with expert parallelism.
MoE chat	DeepSeek-V3 671B FP8	16x H100 SXM5 (2 nodes)	256-512	4,200-6,400	TP=8 + PP=2 + EP=8, 400Gb IB.
EAGLE-2 speculative	Llama 3.1 70B FP8 + EAGLE	4x H100 SXM5	64-192	9,600-13,800	1.7x uplift over non-speculative.
Blackwell next-gen	Llama 3.1 70B FP4	4x B200	256-512	12,400-16,800	FP4 weights, FP8 KV, second-gen TE.
Offline batch	Llama 3.1 70B FP8	4x H100 SXM5	1024+	13,500-18,000	Disable streaming, max_batch 1024.
Edge inference	Llama 3.1 8B INT4 AWQ	1x L40S 48GB	16-32	1,800-2,400	AWQ INT4, FP16 KV.

Limits and quotas#

Limit	Default	Hard ceiling	How to raise
max_batch_size	1	Memory-bounded (typically 256-512)	Rebuild engine with higher --max_batch_size; revalidate KV budget.
max_input_len	1024	RoPE-limited (e.g. 128K Llama 3.1)	Rebuild with higher --max_input_len; pair with rope scaling and paged context FMHA.
max_output_len	1024	Memory-bounded	Rebuild with higher --max_output_len; budget KV growth.
max_num_tokens	auto	Activation memory	Rebuild with higher --max_num_tokens; raises prefill throughput cap.
max_beam_width	1	8 in practice	Rebuild; beam search cost grows linearly.
max_lora_rank	8	64	Rebuild with higher --max_lora_rank; activation matmul cost rises ~5 percent.
TP size (intra-node)	1	8 (NVLink)	Bounded by GPUs per NVLink island.
PP size (cross-node)	1	~32 in practice	Bounded by pipeline-bubble overhead and IB topology.
EP size (MoE)	1	Total expert count	Bounded by model architecture (e.g. 8 for Mixtral).
Engine binary size	n/a	~Half free GPU memory	Profile with --profiling_verbosity layer_names_only.
Concurrent inference requests / engine	max_batch_size + queue	Memory-bounded	Scale by adding Triton model instances or replicas.
Shared memory (NCCL)	/dev/shm	Container-defined	Mount /dev/shm >= 1 GB per worker; required for TP > 1.
File descriptors	1024	ulimit	ulimit -n 65536 inside the Triton container.

Observability#

nv_inference_request_duration_us — Triton total request latency including queueing.
nv_inference_queue_duration_us — time the request spent waiting before scheduling.
nv_trt_llm_kv_cache_block_fraction — fraction of paged KV blocks in use; 0.95+ means capacity headroom is gone.
nv_trt_llm_paused_requests — sequences evicted from the running batch under KV pressure; non-zero is a capacity smell.
nv_trt_llm_active_request_count — current in-flight batch size; should track max_batch_size at steady-state load.
nv_trt_llm_time_to_first_token_ms — prefill latency; correlate with prompt-length histogram.
nv_trt_llm_inter_token_latency_ms — decode latency; should approach 1 / theoretical tok/s when batch is full.
DCGM_FI_DEV_SM_OCCUPANCY and DCGM_FI_DEV_GPU_UTIL — pair with TRT-LLM metrics to distinguish compute vs memory vs idle bottlenecks.

yaml

# Prometheus rules for a TensorRT-LLM deployment behind Triton
groups:
  - name: trtllm-sla
    interval: 30s
    rules:
      - alert: TRTLLMHighTimeToFirstToken
        expr: histogram_quantile(0.95,
                sum by (le, model) (
                  rate(nv_trt_llm_time_to_first_token_ms_bucket[5m]))) > 400
        for: 5m
        labels: { severity: warning, team: inference }
        annotations:
          summary: "TRT-LLM TTFT p95 above 400ms on {{ $labels.model }}"

      - alert: TRTLLMKVCachePressure
        expr: nv_trt_llm_kv_cache_block_fraction > 0.95
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Paged KV cache >95 percent full — pauses imminent"

      - alert: TRTLLMPausedRequestsSpike
        expr: increase(nv_trt_llm_paused_requests[5m]) > 10
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Pause rate climbing — capacity insufficient or runaway request"

      - alert: TRTLLMQueueBuildup
        expr: histogram_quantile(0.95,
                sum by (le, model) (
                  rate(nv_inference_queue_duration_us_bucket[5m]))) > 500000
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Triton queue p95 above 500ms — under-provisioned or stuck batch"

      - alert: TRTLLMSMUnderutilised
        expr: avg_over_time(DCGM_FI_DEV_SM_OCCUPANCY[5m]) < 0.30
              and rate(nv_inference_request_success[5m]) > 0
        for: 15m
        labels: { severity: info }
        annotations:
          summary: "GPU SM occupancy under 30 percent — investigate pipeline bubble or kernel selection"

Cost and FinOps#

TensorRT-LLM wins versus vLLM on $/M tokens for any model that lives in production unchanged for more than two weeks. Below that horizon, vLLM's build-free deployment loop typically wins on total cost of ownership.
FP8 weights + FP8 KV + FP8 context FMHA is the highest $/M-tokens lever on Hopper; BF16 is roughly 1.7x more expensive at the same SLO.
Engine-build cost on a 70B FP8 TP=4 model is ~25 minutes on the same hardware that serves it. Build once, amortise across the engine lifetime; do not rebuild for trivial config changes.
FOCUS-conformant billing exports from Yobitel include `inference_engine=tensorrt-llm`, `model_name` and `engine_build_id` resource tags so $/M tokens can be sliced by tenant, model and engine generation.
Spot capacity is harder to operate with TRT-LLM than with vLLM: engine load takes 30-90 seconds on a fresh GPU, so pre-emption costs more wall-clock time. Reserve a small on-demand floor for spot-pre-empted traffic.

Configuration	GPU rate ($/h)	Sustained tok/s	$/M output tokens	Notes
1x H100 SXM5, Llama 3.1 8B FP8	$3.20	7,200	$0.12	TP=1, in-flight batching.
4x H100 SXM5, Llama 3.1 70B FP8	$12.40	6,200	$0.56	TP=4, fused MLP, FP8 context FMHA.
8x H100 SXM5, Llama 3.1 70B FP8	$24.80	11,400	$0.60	TP=8, intra-NVLink.
4x H200, Llama 3.1 70B 128K ctx	$16.80	2,800	$1.67	Long context tax; paged context FMHA.
4x B200, Llama 3.1 70B FP4	$22.00	14,800	$0.41	Blackwell FP4, second-gen TE.
4x H100, EAGLE-2 speculative	$12.40	10,800	$0.32	1.74x vs non-speculative same SKU.
4x H100 spot, Llama 3.1 70B	$6.20	5,000	$0.34	Spot interruption averaged in; engine load time per restart.
Hosted SaaS reference (GPT-4o mini class)	n/a	n/a	$0.60	List API price; comparison only.

Security and compliance#

Migration and alternatives#

From	Migration effort	Throughput change	Operational notes
HuggingFace transformers.generate	Medium — build pipeline + API swap	10-15x faster	Eliminates Python serving loop; requires CI for engine builds.
vLLM	Medium — engine compile + Triton	1.4-1.8x faster	Gain min-latency; lose fast model rotation.
TGI (Text Generation Inference)	Medium — same OpenAI API	1.3-1.6x faster	Same compile-time discipline as TRT-LLM; gain NVIDIA kernel depth.
SGLang	Medium — compile + Triton	Comparable at chat; TRT-LLM wins long context	Lose RadixAttention prefix sharing; gain absolute-min latency.
OpenAI / Bedrock / Anthropic API	High — model substitution + ops	Variable	Gain control, sovereignty; absorb engine-build CI overhead.
NVIDIA NeMo Inference	Low — same stack family	Comparable	TRT-LLM is the NeMo Inference engine under the hood.

bash

# TRT-LLM behind Triton on Kubernetes with NVIDIA GPU Operator
kubectl apply -f - <<'YAML'
apiVersion: apps/v1
kind: Deployment
metadata: { name: llama3-70b-trtllm }
spec:
  replicas: 2
  selector: { matchLabels: { app: llama3-70b-trtllm } }
  template:
    metadata: { labels: { app: llama3-70b-trtllm } }
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
          args:
            - "tritonserver"
            - "--model-repository=/models"
            - "--grpc-port=8001"
            - "--http-port=8000"
            - "--metrics-port=8002"
          resources:
            limits: { nvidia.com/gpu: 4 }
          ports:
            - { containerPort: 8000, name: http }
            - { containerPort: 8001, name: grpc }
            - { containerPort: 8002, name: metrics }
          volumeMounts:
            - { name: engines, mountPath: /models }
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: engines
          persistentVolumeClaim: { claimName: llama3-70b-engines }
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 8Gi }
YAML

# Equivalent vLLM deployment for the same model (for migration comparison)
# vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
#     --tensor-parallel-size 4 --quantization fp8 --max-model-len 32768 \
#     --enable-prefix-caching --enable-chunked-prefill --port 8000

# Equivalent on AWS (bare p5 with NGC TRT-LLM container)
aws ec2 run-instances \
    --instance-type p5.48xlarge \
    --image-id "$(aws ec2 describe-images --owners amazon \
        --filters 'Name=name,Values=Deep Learning OSS Nvidia Driver AMI GPU PyTorch*' \
        --query 'sort_by(Images,&CreationDate)[-1].ImageId' --output text)" \
    --user-data "$(cat <<'EOF'
#!/bin/bash
docker run --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    -v /opt/engines:/models \
    nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 \
    tritonserver --model-repository=/models
EOF
)"

Troubleshooting#

Symptom / Error	Cause	Fix
Engine load fails after driver upgrade with 'Plugin not found'	Plugin library ABI changed across CUDA / TRT versions.	Rebuild engines against the new TRT-LLM container; pin driver and TRT-LLM versions in CI.
torch.cuda.OutOfMemoryError on first request	kv_cache_free_gpu_mem_fraction too high; activations crowd KV pool.	Lower to 0.88; verify engine activation working set; ensure no other CUDA process on GPU.
NCCL hang on Triton startup with TP > 1	/dev/shm too small or NVIDIA_VISIBLE_DEVICES misordered.	Mount /dev/shm >= 8 GB; pin device ordering; set NCCL_DEBUG=INFO; verify NVLink topology with nvidia-smi topo -m.
Very slow first token after pod restart	Engine load and CUDA-graph capture on cold start.	Expected for first 30-90 s; pre-warm with a synthetic request before flipping traffic at the gateway.
FP8 accuracy regression vs FP16 baseline	Calibration set unrepresentative of production traffic.	Recalibrate on a 512-sample slice of real prompts; rebuild engine; revalidate against the equivalence suite.
Paused requests climbing in steady state	max_batch_size too high for KV-cache budget at observed prompt lengths.	Rebuild engine with lower max_batch_size, or raise kv_cache_free_gpu_mem_fraction; never exceed 0.95.
TTFT p95 spikes when long prompts arrive	Long prefill monopolises a forward pass; chunked context not enabled.	Rebuild with --use_paged_context_fmha enable; tune --max_num_tokens to 8192-16384.
Throughput drops after upgrading TRT-LLM minor version	Kernel autotuner picked a regression path.	Force kernel selection with --gemm_plugin fp8 --gpt_attention_plugin fp8; capture Nsight trace; file upstream issue.
EAGLE-2 throughput worse than non-speculative	Draft acceptance rate too low for the workload.	Halve --max_draft_len; switch draft to EAGLE head trained on closer distribution; measure accept rate via metric.
Multi-node deployment never reaches steady state	Pipeline bubble too large or NCCL over InfiniBand misconfigured.	Lower --pp_size; set NCCL_IB_HCA, NCCL_SOCKET_IFNAME; verify GPUDirect RDMA with nccl-tests.
LoRA adapter activation latency dominates	Too many --max_lora_rank or too many adapters resident.	Cap rank at 16-32 on H100; benchmark adapter activation matmul; consider per-tenant engine if uplift is large.
Engine binary refuses to load on a different GPU SKU	Engine is pinned to (model, precision, TP, GPU SM, TRT version).	Rebuild against the target SKU; engines do not cross H100 -> B200 or H100 -> L40S.

Where this fits in the Yobitel stack#

References

TensorRT-LLM on GitHub · GitHub (NVIDIA)
TensorRT-LLM Documentation · NVIDIA
NVIDIA Transformer Engine · NVIDIA Developer
Triton Inference Server with TensorRT-LLM Backend · GitHub (Triton)
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision · arXiv (Shah et al., 2024)
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees · arXiv (Li et al., 2024)
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads · arXiv (Cai et al., 2024)
NVIDIA Confidential Computing on Hopper · NVIDIA Developer

TensorRT-LLM

Overview#

Quick start#

How it works#

Reference and specifications#

Workload patterns#

Sizing and capacity planning#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

TensorRT-LLM

Overview#

Quick start#

How it works#

Reference and specifications#

Workload patterns#

Sizing and capacity planning#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel