Triton Inference Server

TL;DR

General-purpose inference server from NVIDIA, originally TensorRT Inference Server, open-sourced 2018 and BSD 3-Clause licensed. Serves models from TensorRT, TensorRT-LLM, vLLM, ONNX Runtime, PyTorch, TensorFlow, OpenVINO, FIL (forest models), Python and custom C++ backends through one HTTP, gRPC and KFServing v2 endpoint.
Provides dynamic batching, model ensembles, Business Logic Scripting, concurrent model execution, hot model load/unload, A/B and shadow routing, MIG-aware instance groups, Prometheus metrics and OpenTelemetry tracing — the production-grade glue around your engines.
Distinct from TensorRT-LLM: Triton is the SERVER, TensorRT-LLM is one possible BACKEND. The `tensorrtllm_backend` makes Triton the recommended production deployment for NVIDIA-native LLM workloads; the `vllm_backend` does the same for vLLM engines.
Distributed via the NVIDIA NGC container (`nvcr.io/nvidia/tritonserver:25.06-py3`), a community Helm chart, and as the default runtime for the KServe `InferenceService` CRD on Kubernetes.
Standard production server in Yobitel's GPU Cloud and Yobibyte multi-model serving paths; the layer that lets a single GPU host vision, LLM, embedding and tabular models behind one endpoint with MIG-isolated tenancy.

Overview#

Triton Inference Server is NVIDIA's open-source serving runtime for production model inference. It began life in 2018 as TensorRT Inference Server, a thin serving wrapper around TensorRT engines, and has grown into a polyglot multi-model server: a single Triton process can today host a TensorRT engine, a TensorRT-LLM engine, a vLLM engine, a PyTorch TorchScript model, an ONNX classifier, a Python pre/post-processor and an XGBoost tabular model side by side, all behind one HTTP, gRPC and KFServing v2 inference API.

Its design centres on the model repository — a directory layout where each subdirectory is a versioned model with a `config.pbtxt` describing its backend, inputs, outputs, instance groups and scheduling policy. Triton watches this directory and hot-loads, hot-unloads and hot-updates models as files change, with optional polling or an explicit `repository/index` admin API. Per-model `instance_group` declarations pin replicas to GPU IDs, MIG slices, CPU cores or specific compute capabilities; per-model `dynamic_batching` declarations control how Triton coalesces requests; the `ensemble_scheduling` block expresses pipelines server-side as a static DAG of model calls.

The critical mental model is that Triton is a server, not an engine. It does not implement attention kernels, weight loading or autoregressive decoding directly — it delegates that work to backends. For LLM serving in production, the recommended split is to compile your model with TensorRT-LLM, deploy it behind Triton with the `tensorrtllm_backend`, and let Triton handle the HTTP / gRPC surface, request queuing, model versioning, MIG-aware concurrency and metrics. The same pattern applies for vLLM engines via the `vllm_backend`, and for ONNX / PyTorch / TensorFlow models via their respective backends.

By mid-2026 Triton ships from NVIDIA on a roughly monthly release cadence aligned with the NGC container index (24.10 was the September 2024 cut; 25.06 is the June 2026 cut shipping in the same image namespace). Backends move at their own cadence — TensorRT-LLM is monthly, vLLM is two- to three-weekly, ONNX Runtime is quarterly — and Triton pins compatible backend versions per release. Yobibyte exposes Triton as the opt-in serving runtime for multi-model and ensemble workloads — when a Yobitel customer needs to host an LLM, a vision encoder and a tabular ranker behind one endpoint, or wants MIG-isolated tenancy on shared H100 / H200 / B200 hardware, Triton is the layer the platform reaches for. The model repository and `config.pbtxt` are generated from the workspace definition rather than authored by hand.

This entry documents the production surface: the `tritonserver` CLI, the model-repository layout and `config.pbtxt` fields, the request lifecycle and dynamic batcher mechanics, the workload patterns where Triton wins, deployment, sizing, limits, observability hooks, costs and the migration story. This entry helps you stand up Triton for production multi-model serving with the right flags, sizing and operational practices — whether you are operating raw upstream on your own NVIDIA fleet or consuming Triton as the Yobibyte opt-in for multi-model and MIG-tenant workloads.

Quick start#

The example below builds a multi-model repository with a Llama 3.1 70B TensorRT-LLM engine, a CLIP ViT-L/14 image encoder served via ONNX Runtime, and a small Python pre-processor that resizes images, then launches Triton serving all three behind the same HTTP endpoint. The fourth snippet hits each model with `curl` to show the unified surface.

bash

# 0. Model repository layout
mkdir -p models/llama3_70b/1 models/clip_image/1 models/preprocess/1

# 1. Llama 3.1 70B via the TensorRT-LLM backend
cat > models/llama3_70b/config.pbtxt <<'EOF'
name: "llama3_70b"
backend: "tensorrtllm"
max_batch_size: 64

model_transaction_policy { decoupled: true }

parameters: { key: "gpt_model_type"  value: { string_value: "inflight_fused_batching" } }
parameters: { key: "gpt_model_path"  value: { string_value: "/models/llama3_70b/1" } }
parameters: { key: "max_beam_width"  value: { string_value: "1" } }

instance_group [{ count: 1, kind: KIND_GPU, gpus: [0,1,2,3] }]
EOF
# (then drop the pre-built TensorRT-LLM engine files into models/llama3_70b/1/)

# 2. CLIP image encoder via ONNX Runtime
cat > models/clip_image/config.pbtxt <<'EOF'
name: "clip_image"
backend: "onnxruntime"
max_batch_size: 32

input  [{ name: "pixel_values", data_type: TYPE_FP16, dims: [3, 224, 224] }]
output [{ name: "image_embeds", data_type: TYPE_FP16, dims: [768] }]

dynamic_batching {
  preferred_batch_size: [ 4, 8, 16, 32 ]
  max_queue_delay_microseconds: 2000
}

instance_group [{ count: 2, kind: KIND_GPU, gpus: [4] }]
EOF

# 3. Python pre-processor (resize + normalise)
cat > models/preprocess/config.pbtxt <<'EOF'
name: "preprocess"
backend: "python"
max_batch_size: 32

input  [{ name: "raw_image",   data_type: TYPE_UINT8, dims: [-1, -1, 3] }]
output [{ name: "pixel_values", data_type: TYPE_FP16, dims: [3, 224, 224] }]

instance_group [{ count: 4, kind: KIND_CPU }]
EOF

# 4. Launch Triton
docker run --gpus all --rm --shm-size 8g \
    -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    -v $PWD/models:/models \
    nvcr.io/nvidia/tritonserver:25.06-py3 \
    tritonserver --model-repository=/models \
                 --model-control-mode=explicit \
                 --load-model=llama3_70b --load-model=clip_image --load-model=preprocess \
                 --metrics-port=8002

# 5. Inference (KFServing v2 / Triton REST)
curl http://localhost:8000/v2/models/clip_image/infer \
    -H "Content-Type: application/json" \
    -d '{
      "inputs": [{ "name": "pixel_values", "shape": [1,3,224,224],
                   "datatype": "FP16", "data": [/* tensor floats */] }],
      "outputs": [{ "name": "image_embeds" }]
    }'

curl http://localhost:8000/v2/models/llama3_70b/generate \
    -H "Content-Type: application/json" \
    -d '{ "text_input": "Summarise Triton in 2 lines.", "max_tokens": 128 }'

Mount `/dev/shm >= 8GiB` (`--shm-size 8g` on docker or an `emptyDir { medium: Memory }` of equivalent size on Kubernetes). Triton uses shared memory aggressively for the Python and BLS backends and for NCCL collectives in TP>1 LLM engines; the default 64MiB will OOM on the first big request.

How it works#

A Triton deployment has three concentric concepts: the model repository, the backend, and the request lifecycle. The model repository is a versioned directory; each model subdirectory contains a `config.pbtxt` (protobuf describing schema and scheduling) plus numbered version directories (`1/`, `2/`, …) containing the actual model artefacts (a `.plan` for TensorRT, an engine directory for TensorRT-LLM, a `.onnx` for ONNX Runtime, a `model.py` for the Python backend, and so on). Triton reads the config on load, validates the artefacts against it, and routes inference requests to the appropriate backend.

Backends are dynamically loaded shared objects implementing the Triton backend API. The TensorRT backend executes serialised TensorRT engines; the TensorRT-LLM backend wraps the TensorRT-LLM C++ runtime and bypasses the generic dynamic batcher in favour of in-flight (continuous) batching managed inside the backend itself; the vLLM backend wraps the vLLM Python engine; the ONNX Runtime backend executes ONNX graphs; the Python backend (and its PyTriton helper) runs arbitrary user Python code; the FIL backend executes XGBoost, LightGBM and scikit-learn forest models on GPU; the OpenVINO backend runs Intel-optimised CPU and integrated-GPU graphs.

The request lifecycle for an ordinary (non-LLM) model goes: client sends an HTTP / gRPC inference request; Triton authenticates and validates the request against the model's input schema; the dynamic batcher queues the request until either the configured queue delay elapses or the queue reaches a preferred batch size; the batcher hands the assembled batch to a model instance from the `instance_group` (one instance per GPU / MIG slice); the backend executes the forward pass; Triton splits the batched output back into per-request responses and returns them. For LLMs the path is different: TensorRT-LLM and vLLM backends operate in decoupled mode (`model_transaction_policy { decoupled: true }`), where Triton streams partial responses back as tokens are produced and the in-flight batching loop inside the backend manages concurrency directly.

Ensembles compose models server-side. The `ensemble_scheduling` block in `config.pbtxt` describes a static DAG of `step`s, each step calling another model in the repository with input tensors mapped from prior steps or from the original request. A common RAG ensemble is preprocess -> embed -> retrieve -> rerank -> generate, all executed inside Triton with no client round-trips between steps. Business Logic Scripting (BLS) extends this with dynamic Python control flow via the Python backend — useful when you need conditional branches, loops or calls to external services in the pipeline.

Model repository: versioned directory of `config.pbtxt` plus per-version artefacts; watched for hot-load / hot-unload / hot-update.
Backends: TensorRT, TensorRT-LLM, vLLM, ONNX Runtime, PyTorch (TorchScript and `torch.compile`), TensorFlow, OpenVINO, FIL, Python, custom C++.
Dynamic batcher: per-model `preferred_batch_size`, `max_queue_delay_microseconds`, priority queues, request timeouts.
Instance groups: pin model replicas to GPU IDs, MIG slices or CPU cores; `count` controls per-GPU concurrency.
Model warmup: optional `model_warmup` block runs synthetic requests at load time so the first real request never pays JIT cost.
Ensembles: server-side DAG of model calls expressed in `config.pbtxt`; zero client round-trips.
Business Logic Scripting (BLS): dynamic Python pipelines via the Python backend with conditional control flow.
MIG-aware: instance groups can target specific MIG slices for tenancy isolation on H100 / H200 / B200.
Hot model control: `POST /v2/repository/models/{name}/load|unload` to swap models without restart.
Protocols: HTTP / REST, gRPC streaming, KFServing v2; bundled OpenAI-compatible frontend for the TensorRT-LLM / vLLM backends.

For LLM workloads, always set `model_transaction_policy { decoupled: true }` on the `config.pbtxt` and rely on the TensorRT-LLM or vLLM backend's in-flight batcher. Triton's generic dynamic batcher coalesces requests at the model boundary, which is the wrong unit of work for autoregressive decoding — you want token-level batching, not request-level.

Reference and specifications#

Triton has two reference surfaces: the `tritonserver` CLI (process-level options like model repository, ports, model control mode, metrics) and the per-model `config.pbtxt` (model-level options like backend, schema, batching policy, instance groups). The table below is the canonical reference for the most-touched fields as of Triton 25.06 (June 2026). Fields not listed are either internal tuning knobs whose defaults are correct or specialised features documented in the upstream reference.

Surface	Field / flag	Type	Description
CLI	--model-repository	path	Required. Root of the model repository directory tree.
CLI	--model-control-mode	enum	none (load all on start) \| poll (re-scan periodically) \| explicit (load/unload via API).
CLI	--load-model	list	Model names to load at startup when --model-control-mode=explicit.
CLI	--repository-poll-secs	int	Polling interval when --model-control-mode=poll.
CLI	--http-port / --grpc-port	int	API ports (default 8000 / 8001).
CLI	--metrics-port	int	Prometheus scrape port (default 8002).
CLI	--strict-readiness	bool	/v2/health/ready only returns 200 once every model loads.
CLI	--allow-cuda-graph	bool	Enables CUDA-graph capture for compatible backends.
CLI	--backend-config	k=v	Per-backend configuration, e.g. tensorrt,coalesce-request-input=true.
CLI	--trace-config	k=v	OpenTelemetry / native trace configuration.
CLI	--exit-on-error	bool	Fail-fast on any model load failure (recommended in production).
config.pbtxt	name	string	Model name as referenced by API; must match the directory name.
config.pbtxt	backend	string	tensorrt \| tensorrtllm \| vllm \| onnxruntime \| pytorch \| tensorflow \| openvino \| python \| fil \| dali.
config.pbtxt	max_batch_size	int	0 to disable batching (LLMs); otherwise the maximum batch the model can accept.
config.pbtxt	input / output	tensor list	Schema: name, datatype, dims; -1 marks dynamic dimensions.
config.pbtxt	dynamic_batching	block	preferred_batch_size, max_queue_delay_microseconds, priority_levels.
config.pbtxt	instance_group	list	count, kind (KIND_GPU / KIND_CPU / KIND_AUTO / KIND_MODEL), gpus, profile.
config.pbtxt	model_warmup	list	Synthetic requests run on load so first real request avoids JIT.
config.pbtxt	model_transaction_policy	block	decoupled: true required for streaming LLM responses.
config.pbtxt	ensemble_scheduling	block	Static DAG of step{ model_name, model_version, input_map, output_map }.
config.pbtxt	optimization	block	graph optimisation level (ONNX), cuda graphs, execution accelerators (TensorRT inside ONNX).
config.pbtxt	parameters	k=v map	Backend-specific options (e.g. gpt_model_path, max_beam_width for TensorRT-LLM).
config.pbtxt	version_policy	block	all \| latest:N \| specific [v1, v3].
config.pbtxt	rate_limiter	block	resources required by a request; enables fair queuing across models.
config.pbtxt	response_cache	block	enable: true caches identical requests when --response-cache-byte-size is set.
API	POST /v2/models/{name}/infer	HTTP	KFServing v2 inference endpoint.
API	POST /v2/models/{name}/generate	HTTP	Bundled generate endpoint for TensorRT-LLM / vLLM backends.
API	POST /v2/repository/models/{name}/load	HTTP	Hot-load a model when --model-control-mode=explicit.
API	GET /metrics	HTTP	Prometheus metrics scrape endpoint.
API	GET /v2/health/{live,ready}	HTTP	Liveness and readiness probes.

The `response_cache` block is often missed and is the single most cost-effective optimisation for retrieval, embedding and tabular models — identical requests return in microseconds with no model execution. Pair with `--response-cache-byte-size` at process level.

Workload patterns#

Triton's design pays off most on three workload shapes. Pattern A is multi-model serving — one GPU box hosting an LLM, a vision encoder, an embedding model and a tabular ranker behind one endpoint, which is the difference between four FastAPI deployments and a single Helm release. Pattern B is ensemble pipelines for RAG — preprocess, embed, retrieve, rerank, generate executed server-side with no client round-trips between stages. Pattern C is MIG-partitioned shared GPU serving — multiple tenants pinned to isolated MIG slices on the same H100 / H200 / B200 with hard memory and SM isolation. These are the three patterns Yobibyte routes to a Triton-backed runtime — a team standing this up on raw upstream signs up to author the `config.pbtxt` files, manage backend version pins and operate the MIG slicing themselves; the Yobibyte workspace generates and operates them.

bash

# A — multi-model on a single 8-GPU H100 box (LLM + CLIP + tabular)
#   See the Quick start config.pbtxt files; launch with:
docker run --gpus all --shm-size 8g \
    -v $PWD/models:/models -p 8000:8000 -p 8002:8002 \
    nvcr.io/nvidia/tritonserver:25.06-py3 \
    tritonserver --model-repository=/models --metrics-port=8002

# B — server-side RAG ensemble (preprocess -> embed -> retrieve -> rerank -> generate)
cat > models/rag_pipeline/config.pbtxt <<'EOF'
name: "rag_pipeline"
platform: "ensemble"
max_batch_size: 16
input  [{ name: "query", data_type: TYPE_STRING, dims: [1] }]
output [{ name: "answer", data_type: TYPE_STRING, dims: [1] }]
ensemble_scheduling {
  step [
    { model_name: "tokenize_query" model_version: -1
      input_map  { key: "TEXT"   value: "query" }
      output_map { key: "TOKENS" value: "q_tok"  } },
    { model_name: "embed_query"   model_version: -1
      input_map  { key: "TOKENS" value: "q_tok" }
      output_map { key: "EMBED"  value: "q_vec" } },
    { model_name: "vector_search" model_version: -1
      input_map  { key: "VEC"    value: "q_vec" }
      output_map { key: "CTX"    value: "ctx"   } },
    { model_name: "rerank"        model_version: -1
      input_map  { key: "Q"      value: "query"
                   key: "CTX"    value: "ctx" }
      output_map { key: "CTX2"   value: "ctx2"  } },
    { model_name: "llama3_70b"    model_version: -1
      input_map  { key: "text_input" value: "ctx2" }
      output_map { key: "text_output" value: "answer" } }
  ]
}
EOF

# C — MIG-isolated multi-tenant on H100 (instance group pinned per MIG slice)
# nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C creates 7x 1g.10gb slices on each H100;
# tenant_a's config.pbtxt then pins to one of them:
cat > models/tenant_a_llama8b/config.pbtxt <<'EOF'
name: "tenant_a_llama8b"
backend: "tensorrtllm"
max_batch_size: 16
instance_group [{ count: 1, kind: KIND_GPU, gpus: [0]
                  profile: ["MIG-3g.40gb"] }]
parameters: { key: "gpt_model_path" value: { string_value: "/models/tenant_a_llama8b/1" } }
EOF

If the workload is a single LLM with no companion models and the goal is the simplest path from HF repo id to an OpenAI-compatible endpoint, you do not need Triton — run vLLM or SGLang standalone. Reach for Triton when (a) you have multiple model types, (b) you need server-side ensembles, or (c) you need MIG-isolated multi-tenant serving on one box.

Sizing and capacity planning#

Sizing for Triton is mostly sizing for the underlying backend — a Llama 3.1 70B served via the TensorRT-LLM backend sizes the same as a stand-alone TensorRT-LLM deployment; a CLIP encoder sizes the same as the same ONNX Runtime engine elsewhere. Triton itself adds a small overhead (typically <2GB GPU memory per backend loaded and <0.5ms per request at the dynamic batcher) and the instance-group count multiplies the per-replica memory footprint.

The sizing table below is for the multi-model and ensemble patterns where Triton's footprint matters; pure single-LLM-engine sizing should be read from the TensorRT-LLM or vLLM entry directly. All throughput figures are mid-range observed values from InferenceBench v3 with the noted backend; treat as planning anchors.

Workload	Mix	Recommended SKU	Backend split	Throughput	Notes
Multi-model serving	1 LLM + 1 vision + 1 tabular	8x H100 SXM5	TRT-LLM (TP=4) + ONNX + FIL	see backends	Single box replaces 3 deployments.
RAG ensemble	tokenise+embed+retrieve+rerank+generate	4x H100 SXM5	Python + ONNX + TRT-LLM (TP=4)	1,200-2,000 q/s	Zero client round-trips.
MIG multi-tenant	7 tenants on 1 H100	1x H100 SXM5 80GB (7x 1g.10gb MIG)	TRT-LLM per slice	~600-900 tok/s per slice	Hard isolation per tenant.
Vision-only fleet	CLIP ViT-L/14 + classifier	1x L40S 48GB	ONNX Runtime + TensorRT	8,000-14,000 img/s	Dynamic batcher dominant.
Forest models (XGBoost)	Tabular ranker / fraud score	1x L4 24GB	FIL	150,000-400,000 row/s	Response cache adds 2-5x.
Voice + transcript	Whisper + LLM	1x H100 SXM5	PyTorch + TRT-LLM (TP=1)	Real-time	model_transaction_policy decoupled.
Sovereign multi-model	5 tenants on H200 MIG	1x H200 141GB (4x 2g.35gb + 1x 3g.71gb)	TRT-LLM + ONNX per slice	Mixed	UK / EU sovereign tenancies.

Limits and quotas#

Triton enforces a small set of hard limits at the API and a larger set of per-model limits expressed in `config.pbtxt`. Operational ceilings come from the host OS, CUDA runtime and the underlying backend; the table below is the Triton-specific layer that operators tune most often.

Limit	Default	Hard ceiling	How to raise
max_batch_size	model-defined	Backend-dependent	Set in config.pbtxt; 0 disables batching (LLMs).
instance_group count per GPU	1	Memory-bounded	Raise to increase per-GPU concurrency; watch memory.
preferred_batch_size	model-defined	max_batch_size	Tune for the cliff in latency vs throughput.
max_queue_delay_microseconds	0	Application-defined	Trade p99 latency for throughput; 1-5ms typical.
Concurrent backends loaded	Unlimited	GPU memory	Limited by sum of per-backend footprints.
Models loaded simultaneously	Repository size	Memory	Use --model-control-mode=explicit for cold-tier.
Response cache size	0 (off)	--response-cache-byte-size	Set at process; per-model `response_cache { enable: true }`.
Request body size	Unlimited	HTTP server limit	Set --buffer-manager-thread-count and ingress timeouts.
Shared memory (Python/BLS)	/dev/shm	Container-defined	Mount >= 8GiB on multi-backend or TP>1 LLM deployments.
TP (within an LLM backend)	1	8 (NVLink island)	TensorRT-LLM / vLLM backend flag; not Triton itself.
File descriptors	1024	ulimit	ulimit -n 65536 in container.
Backend version compatibility	Pinned per Triton release	—	Pin Triton container tag; do not mix manually.

Observability#

Triton exposes a Prometheus metrics endpoint at `/metrics` (default port 8002) with per-model request count, request duration, queue duration, compute duration, batch size distribution and pending requests, plus per-GPU DCGM-style memory and utilisation metrics when DCGM exporter is co-deployed. The TensorRT-LLM and vLLM backends layer their own engine-level metrics on top (`nv_inference_*` and `vllm:*` respectively). Tracing uses OpenTelemetry via `--trace-config` and emits one span per request, with sub-spans per ensemble step.

The metrics worth alerting on in production are queue-vs-compute time ratio (saturation), per-model success rate, GPU memory headroom, and any spike in `nv_inference_pending_request_count`. The following Prometheus rules cover the common failure modes.

nv_inference_count / nv_inference_exec_count — total and per-batch request counters.
nv_inference_request_duration_us — end-to-end p50/p95/p99 per model.
nv_inference_queue_duration_us — time waiting in the dynamic batcher; should be small fraction of request_duration.
nv_inference_compute_input_duration_us / compute_infer / compute_output — per-stage timing for kernel-level debugging.
nv_inference_pending_request_count — backlog depth per model.
nv_inference_request_success / nv_inference_request_failure — success counters; alert on failure rate.
nv_gpu_utilization, nv_gpu_memory_used_bytes — built-in per-GPU metrics (DCGM gives more).
Pair with DCGM_FI_DEV_GPU_UTIL / MEM_COPY_UTIL to disambiguate compute vs memory vs idle.

yaml

# Prometheus rules for a Triton deployment
groups:
  - name: triton-sla
    interval: 30s
    rules:
      - alert: TritonModelHighLatency
        expr: histogram_quantile(0.95,
                sum by (le, model) (
                  rate(nv_inference_request_duration_us_bucket[5m]))) > 500000
        for: 5m
        labels: { severity: warning, team: inference }
        annotations:
          summary: "Triton p95 > 500ms on {{ $labels.model }}"

      - alert: TritonQueueSaturation
        expr: sum by (model) (rate(nv_inference_queue_duration_us[5m])) /
              sum by (model) (rate(nv_inference_request_duration_us[5m])) > 0.4
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Triton queue time > 40% of request time on {{ $labels.model }} — add replicas"

      - alert: TritonRequestFailures
        expr: rate(nv_inference_request_failure[5m]) > 0.01 *
              rate(nv_inference_request_success[5m])
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Triton failure rate above 1% on {{ $labels.model }}"

      - alert: TritonGPUMemoryNearCap
        expr: nv_gpu_memory_used_bytes / nv_gpu_memory_total_bytes > 0.95
        for: 10m
        labels: { severity: warning }

      - alert: TritonModelUnloaded
        expr: changes(nv_inference_count[10m]) == 0 and
              nv_inference_pending_request_count > 0
        for: 15m
        labels: { severity: critical }
        annotations:
          summary: "Model {{ $labels.model }} pending requests but no executions — check load state"

Cost and FinOps#

Triton's direct cost impact comes from three levers: instance-group multiplication (each replica multiplies GPU memory), backend selection (TensorRT-LLM compiled engine vs vLLM PyTorch vs raw ONNX has materially different $/M tokens), and consolidation savings (one Triton box replaces three or four FastAPI services). For LLM workloads, $/M tokens follows the TensorRT-LLM or vLLM cost model directly; the Triton overhead is ~1-3% on throughput and effectively zero on per-token cost.

Consolidation is the under-counted FinOps win. A team running CLIP, BERT and Llama 3.1 8B as three separate FastAPI services on three single-GPU nodes typically pays for 3x the underused capacity, plus three sets of ops on-call. The same workload on a single Triton box with three instance groups runs at higher aggregate utilisation and removes two of the three services from the on-call rota.

Backend choice dominates per-model $/M tokens: pick the backend with the best $/M tokens for that model (TensorRT-LLM for stable LLMs, vLLM for fast-rotating LLMs, ONNX Runtime for small models).
Response cache turns identical requests into microsecond returns — for retrieval and tabular models this can lift effective throughput 2-5x.
Dynamic batcher tuning (`max_queue_delay_microseconds`) is a direct latency-throughput tradeoff worth ~30% throughput at the cost of 1-5ms p99 latency.
FOCUS-conformant billing exports from Yobitel tag each request with `triton_model` and `triton_backend` so $/M tokens can be sliced by model and backend.

Pattern	Before (FastAPI per model)	After (Triton multi-model)	Saving
3 small models on 3 L4 nodes	$1,800/month per node = $5,400	1x L40S node at $1,400	~74%
Vision + LLM + tabular on 3 H100	3x $2,300/month spot = $6,900	1x H100 SXM5 spot = $2,500	~64%
LLM + RAG embed/retrieve/rerank pipeline	5 services on shared GPU	1 Triton ensemble	1 deployment, 2-3x lower latency
MIG-partitioned 7 tenants	7x dedicated 8B endpoints	1x H100 with 7x MIG slices on Triton	~80% on dedicated GPU spend

Security and compliance#

Triton ships with optional HTTPS, gRPC TLS, mTLS and client-certificate authentication via the `--http-restricted-api` and `--ssl-*` flags; production deployments terminate TLS at an ingress (Envoy, NGINX, AWS ALB) and apply signed-JWT or mTLS at that layer. The model repository should be mounted read-only in production; combined with `--exit-on-error` and a CI-driven repository deploy, this makes the running set of models a deterministic CI artefact rather than a mutable directory.

Multi-tenant isolation has two patterns. MIG slices give hard memory and SM isolation on H100 / H200 / B200 — each tenant's `instance_group` pins to a dedicated MIG profile and tenants cannot observe each other's memory or compute. Soft isolation via separate Triton instances behind a router (one Triton per tenant) is simpler but uses more capacity. The `rate_limiter` block applies fair queuing across models on the same instance.

Regulatory posture follows the backends. For UK public-sector workloads, deploy Triton on Yobitel sovereign tenancies satisfying NCSC Cloud Security Principles and G-Cloud 14, with MIG-isolated tenants and read-only repositories. For EU GDPR, the server processes request and response data only in volatile memory and the on-disk scratch path; encrypt ephemeral storage. For US HIPAA, run inside a BAA-covered VPC; for FedRAMP, run the FIPS-validated CUDA build and pin NIAP-approved cipher suites at the ingress.

Migration and alternatives#

Most production migrations to Triton come from one of three origins: raw FastAPI / Flask per model (the most common, biggest operational win), KServe with a custom predictor (when you need ensembles or multi-backend more than autoscaling), or hand-built C++ inference servers (a rare niche from earlier deployments). The decision matrix is straightforward: if you have one model and one engine, stay with the stand-alone engine server (vLLM, SGLang, TensorRT-LLM behind a thin frontend); if you have multiple model types, ensembles, or MIG-partitioned multi-tenancy, Triton is the right level of abstraction.

From	Migration effort	Operational change	Notes
FastAPI per model (3-5 services)	Medium — write config.pbtxt per model	1 deployment instead of N; one set of metrics	Biggest operational simplification.
KServe + custom predictor	Low — KServe can use Triton as runtime	Keep KServe autoscaling, gain ensembles and multi-backend	InferenceService runtime: triton.
Hand-built C++ server	High — port to Triton backend API	Lose custom code; gain hardening and metrics	Worth it for any non-research deployment.
Stand-alone vLLM / SGLang	Low — wrap with vllm_backend	Gain multi-model surface; lose engine-native API quirks	Only do this if you also need other model types.
NVIDIA TensorRT-LLM stand-alone	Low — wrap with tensorrtllm_backend	Gain hardened HTTP/gRPC + metrics + model versioning	Recommended production pattern for TRT-LLM.
Seldon Core / BentoML	Medium — re-express as config.pbtxt	Lose framework-specific deploy ergonomics; gain throughput	Worth it at scale; less so for one model.

yaml

# KServe InferenceService using Triton as runtime (the recommended production pattern)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: llama3-70b-trtllm }
spec:
  predictor:
    triton:
      storageUri: "s3://ml-platform/models/llama3-70b-trtllm/"
      runtimeVersion: "25.06-py3"
      resources:
        limits:
          nvidia.com/gpu: "4"
          memory: "120Gi"
          cpu: "16"
      env:
        - { name: TRITON_MODEL_CONTROL_MODE, value: "explicit" }
      args:
        - --strict-readiness=true
        - --exit-on-error=true
        - --metrics-port=8002
      ports:
        - { containerPort: 8000, name: http }
        - { containerPort: 8001, name: grpc }
        - { containerPort: 8002, name: metrics }

KServe's `runtime: triton` is the production-grade default for multi-backend model serving on Kubernetes. You gain Knative-style scale-to-zero, transformer/predictor split, and canary routing on top of Triton's multi-model surface.

Troubleshooting#

The error table below covers the failure modes that account for roughly 80% of production Triton incidents observed on Yobitel-operated fleets. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix.

Symptom / Error	Cause	Fix
Model fails to load: 'failed to find backend'	Backend shared object missing from container.	Use the matching `-py3` tag; mount additional backends with --backend-directory.
NCCL hang during LLM model load	/dev/shm too small or NVLink P2P disabled.	Mount /dev/shm >= 8GiB; verify nvidia-smi nvlink --status.
High queue time, low compute time	Instance-group count too low for incoming RPS.	Raise instance_group count or add a second Triton replica.
LLM responses arrive only after final token	model_transaction_policy not set to decoupled.	Add `model_transaction_policy { decoupled: true }` to config.pbtxt.
Ensemble step times out	An interior step has lower max_batch_size than the ensemble batch.	Align max_batch_size across the chain or lower the ensemble batch.
GPU memory creeps after hot-load/unload cycles	Backend leaking memory across versions.	Pin to the latest backend; restart Triton on a rolling schedule until fixed.
Triton readiness fails with --strict-readiness	One model failed to load.	Check the failing model in `tritonserver --log-verbose=1` startup logs.
MIG instance group never matches	GPU mode not switched to MIG, or profile string mismatch.	nvidia-smi mig -lgi to list slices; match `profile:` exactly.
Response cache shows zero hits	Per-model `response_cache { enable: true }` not set, or `--response-cache-byte-size` is 0.	Set both at process and per-model.
High p99 spikes correlated with GC	Python backend running heavy per-request allocations.	Use shared-memory tensors; switch Python BLS to C++ where possible.
Backend version mismatch error	Manually swapping a backend .so under a non-matching Triton.	Use the official NGC container; do not mix backend versions.
KFServing v2 client sends wrong dtype	Schema in config.pbtxt expects FP16, client sends FP32.	Align client to schema; or add a Python preprocess model that casts.

Where this fits in the Yobitel stack#

Triton Inference Server is the production-grade serving layer that sits between Yobitel's GPU Cloud and the runtime engines (vLLM, TensorRT-LLM, ONNX Runtime, FIL). When Yobibyte customers deploy a multi-model workload — an LLM plus a vision encoder plus a tabular ranker — or a server-side RAG ensemble, Triton is the runtime that hosts it. Customers do not normally interact with the `config.pbtxt` directly; the Yobibyte console generates the repository from a higher-level workspace definition, but the engine underneath is Triton with the appropriate backend.

For Yobitel's MIG-isolated multi-tenant offerings, Triton is the layer that pins each tenant's models to a dedicated MIG slice on shared H100 / H200 / B200 hardware. The hard memory and SM isolation that MIG provides at the silicon level is exposed to tenants through Triton's instance-group `profile:` field — each tenant's models can only execute on their own MIG slice, and Triton's rate limiter ensures fair API queuing across tenants on the same instance.

For UK and EU sovereign workloads, Triton runs on the Yobitel London-1 and Frankfurt-1 regions inside tenancies that satisfy NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat. The combination of a hardened open-source server (BSD 3-Clause), sovereign hardware, MIG-isolated tenancy and transparent benchmark scoring on InferenceBench is what lets Yobitel customers run multi-model production workloads in regulated environments behind a single API endpoint.

References

Triton Inference Server on GitHub · GitHub (NVIDIA)
Triton Inference Server Documentation · NVIDIA
TensorRT-LLM Backend for Triton · GitHub
vLLM Backend for Triton · GitHub
Python Backend (PyTriton) · GitHub
KServe — Triton Runtime · KServe
NVIDIA Multi-Instance GPU (MIG) User Guide · NVIDIA

TL;DR

General-purpose inference server from NVIDIA, originally TensorRT Inference Server, open-sourced 2018 and BSD 3-Clause licensed. Serves models from TensorRT, TensorRT-LLM, vLLM, ONNX Runtime, PyTorch, TensorFlow, OpenVINO, FIL (forest models), Python and custom C++ backends through one HTTP, gRPC and KFServing v2 endpoint.
Provides dynamic batching, model ensembles, Business Logic Scripting, concurrent model execution, hot model load/unload, A/B and shadow routing, MIG-aware instance groups, Prometheus metrics and OpenTelemetry tracing — the production-grade glue around your engines.
Distinct from TensorRT-LLM: Triton is the SERVER, TensorRT-LLM is one possible BACKEND. The `tensorrtllm_backend` makes Triton the recommended production deployment for NVIDIA-native LLM workloads; the `vllm_backend` does the same for vLLM engines.
Distributed via the NVIDIA NGC container (`nvcr.io/nvidia/tritonserver:25.06-py3`), a community Helm chart, and as the default runtime for the KServe `InferenceService` CRD on Kubernetes.
Standard production server in Yobitel's GPU Cloud and Yobibyte multi-model serving paths; the layer that lets a single GPU host vision, LLM, embedding and tabular models behind one endpoint with MIG-isolated tenancy.

Overview#

Quick start#

bash

# 0. Model repository layout
mkdir -p models/llama3_70b/1 models/clip_image/1 models/preprocess/1

# 1. Llama 3.1 70B via the TensorRT-LLM backend
cat > models/llama3_70b/config.pbtxt <<'EOF'
name: "llama3_70b"
backend: "tensorrtllm"
max_batch_size: 64

model_transaction_policy { decoupled: true }

parameters: { key: "gpt_model_type"  value: { string_value: "inflight_fused_batching" } }
parameters: { key: "gpt_model_path"  value: { string_value: "/models/llama3_70b/1" } }
parameters: { key: "max_beam_width"  value: { string_value: "1" } }

instance_group [{ count: 1, kind: KIND_GPU, gpus: [0,1,2,3] }]
EOF
# (then drop the pre-built TensorRT-LLM engine files into models/llama3_70b/1/)

# 2. CLIP image encoder via ONNX Runtime
cat > models/clip_image/config.pbtxt <<'EOF'
name: "clip_image"
backend: "onnxruntime"
max_batch_size: 32

input  [{ name: "pixel_values", data_type: TYPE_FP16, dims: [3, 224, 224] }]
output [{ name: "image_embeds", data_type: TYPE_FP16, dims: [768] }]

dynamic_batching {
  preferred_batch_size: [ 4, 8, 16, 32 ]
  max_queue_delay_microseconds: 2000
}

instance_group [{ count: 2, kind: KIND_GPU, gpus: [4] }]
EOF

# 3. Python pre-processor (resize + normalise)
cat > models/preprocess/config.pbtxt <<'EOF'
name: "preprocess"
backend: "python"
max_batch_size: 32

input  [{ name: "raw_image",   data_type: TYPE_UINT8, dims: [-1, -1, 3] }]
output [{ name: "pixel_values", data_type: TYPE_FP16, dims: [3, 224, 224] }]

instance_group [{ count: 4, kind: KIND_CPU }]
EOF

# 4. Launch Triton
docker run --gpus all --rm --shm-size 8g \
    -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    -v $PWD/models:/models \
    nvcr.io/nvidia/tritonserver:25.06-py3 \
    tritonserver --model-repository=/models \
                 --model-control-mode=explicit \
                 --load-model=llama3_70b --load-model=clip_image --load-model=preprocess \
                 --metrics-port=8002

# 5. Inference (KFServing v2 / Triton REST)
curl http://localhost:8000/v2/models/clip_image/infer \
    -H "Content-Type: application/json" \
    -d '{
      "inputs": [{ "name": "pixel_values", "shape": [1,3,224,224],
                   "datatype": "FP16", "data": [/* tensor floats */] }],
      "outputs": [{ "name": "image_embeds" }]
    }'

curl http://localhost:8000/v2/models/llama3_70b/generate \
    -H "Content-Type: application/json" \
    -d '{ "text_input": "Summarise Triton in 2 lines.", "max_tokens": 128 }'

How it works#

Model repository: versioned directory of `config.pbtxt` plus per-version artefacts; watched for hot-load / hot-unload / hot-update.
Backends: TensorRT, TensorRT-LLM, vLLM, ONNX Runtime, PyTorch (TorchScript and `torch.compile`), TensorFlow, OpenVINO, FIL, Python, custom C++.
Dynamic batcher: per-model `preferred_batch_size`, `max_queue_delay_microseconds`, priority queues, request timeouts.
Instance groups: pin model replicas to GPU IDs, MIG slices or CPU cores; `count` controls per-GPU concurrency.
Model warmup: optional `model_warmup` block runs synthetic requests at load time so the first real request never pays JIT cost.
Ensembles: server-side DAG of model calls expressed in `config.pbtxt`; zero client round-trips.
Business Logic Scripting (BLS): dynamic Python pipelines via the Python backend with conditional control flow.
MIG-aware: instance groups can target specific MIG slices for tenancy isolation on H100 / H200 / B200.
Hot model control: `POST /v2/repository/models/{name}/load|unload` to swap models without restart.
Protocols: HTTP / REST, gRPC streaming, KFServing v2; bundled OpenAI-compatible frontend for the TensorRT-LLM / vLLM backends.

Reference and specifications#

Surface	Field / flag	Type	Description
CLI	--model-repository	path	Required. Root of the model repository directory tree.
CLI	--model-control-mode	enum	none (load all on start) \| poll (re-scan periodically) \| explicit (load/unload via API).
CLI	--load-model	list	Model names to load at startup when --model-control-mode=explicit.
CLI	--repository-poll-secs	int	Polling interval when --model-control-mode=poll.
CLI	--http-port / --grpc-port	int	API ports (default 8000 / 8001).
CLI	--metrics-port	int	Prometheus scrape port (default 8002).
CLI	--strict-readiness	bool	/v2/health/ready only returns 200 once every model loads.
CLI	--allow-cuda-graph	bool	Enables CUDA-graph capture for compatible backends.
CLI	--backend-config	k=v	Per-backend configuration, e.g. tensorrt,coalesce-request-input=true.
CLI	--trace-config	k=v	OpenTelemetry / native trace configuration.
CLI	--exit-on-error	bool	Fail-fast on any model load failure (recommended in production).
config.pbtxt	name	string	Model name as referenced by API; must match the directory name.
config.pbtxt	backend	string	tensorrt \| tensorrtllm \| vllm \| onnxruntime \| pytorch \| tensorflow \| openvino \| python \| fil \| dali.
config.pbtxt	max_batch_size	int	0 to disable batching (LLMs); otherwise the maximum batch the model can accept.
config.pbtxt	input / output	tensor list	Schema: name, datatype, dims; -1 marks dynamic dimensions.
config.pbtxt	dynamic_batching	block	preferred_batch_size, max_queue_delay_microseconds, priority_levels.
config.pbtxt	instance_group	list	count, kind (KIND_GPU / KIND_CPU / KIND_AUTO / KIND_MODEL), gpus, profile.
config.pbtxt	model_warmup	list	Synthetic requests run on load so first real request avoids JIT.
config.pbtxt	model_transaction_policy	block	decoupled: true required for streaming LLM responses.
config.pbtxt	ensemble_scheduling	block	Static DAG of step{ model_name, model_version, input_map, output_map }.
config.pbtxt	optimization	block	graph optimisation level (ONNX), cuda graphs, execution accelerators (TensorRT inside ONNX).
config.pbtxt	parameters	k=v map	Backend-specific options (e.g. gpt_model_path, max_beam_width for TensorRT-LLM).
config.pbtxt	version_policy	block	all \| latest:N \| specific [v1, v3].
config.pbtxt	rate_limiter	block	resources required by a request; enables fair queuing across models.
config.pbtxt	response_cache	block	enable: true caches identical requests when --response-cache-byte-size is set.
API	POST /v2/models/{name}/infer	HTTP	KFServing v2 inference endpoint.
API	POST /v2/models/{name}/generate	HTTP	Bundled generate endpoint for TensorRT-LLM / vLLM backends.
API	POST /v2/repository/models/{name}/load	HTTP	Hot-load a model when --model-control-mode=explicit.
API	GET /metrics	HTTP	Prometheus metrics scrape endpoint.
API	GET /v2/health/{live,ready}	HTTP	Liveness and readiness probes.

Workload patterns#

bash

# A — multi-model on a single 8-GPU H100 box (LLM + CLIP + tabular)
#   See the Quick start config.pbtxt files; launch with:
docker run --gpus all --shm-size 8g \
    -v $PWD/models:/models -p 8000:8000 -p 8002:8002 \
    nvcr.io/nvidia/tritonserver:25.06-py3 \
    tritonserver --model-repository=/models --metrics-port=8002

# B — server-side RAG ensemble (preprocess -> embed -> retrieve -> rerank -> generate)
cat > models/rag_pipeline/config.pbtxt <<'EOF'
name: "rag_pipeline"
platform: "ensemble"
max_batch_size: 16
input  [{ name: "query", data_type: TYPE_STRING, dims: [1] }]
output [{ name: "answer", data_type: TYPE_STRING, dims: [1] }]
ensemble_scheduling {
  step [
    { model_name: "tokenize_query" model_version: -1
      input_map  { key: "TEXT"   value: "query" }
      output_map { key: "TOKENS" value: "q_tok"  } },
    { model_name: "embed_query"   model_version: -1
      input_map  { key: "TOKENS" value: "q_tok" }
      output_map { key: "EMBED"  value: "q_vec" } },
    { model_name: "vector_search" model_version: -1
      input_map  { key: "VEC"    value: "q_vec" }
      output_map { key: "CTX"    value: "ctx"   } },
    { model_name: "rerank"        model_version: -1
      input_map  { key: "Q"      value: "query"
                   key: "CTX"    value: "ctx" }
      output_map { key: "CTX2"   value: "ctx2"  } },
    { model_name: "llama3_70b"    model_version: -1
      input_map  { key: "text_input" value: "ctx2" }
      output_map { key: "text_output" value: "answer" } }
  ]
}
EOF

# C — MIG-isolated multi-tenant on H100 (instance group pinned per MIG slice)
# nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C creates 7x 1g.10gb slices on each H100;
# tenant_a's config.pbtxt then pins to one of them:
cat > models/tenant_a_llama8b/config.pbtxt <<'EOF'
name: "tenant_a_llama8b"
backend: "tensorrtllm"
max_batch_size: 16
instance_group [{ count: 1, kind: KIND_GPU, gpus: [0]
                  profile: ["MIG-3g.40gb"] }]
parameters: { key: "gpt_model_path" value: { string_value: "/models/tenant_a_llama8b/1" } }
EOF

Sizing and capacity planning#

Workload	Mix	Recommended SKU	Backend split	Throughput	Notes
Multi-model serving	1 LLM + 1 vision + 1 tabular	8x H100 SXM5	TRT-LLM (TP=4) + ONNX + FIL	see backends	Single box replaces 3 deployments.
RAG ensemble	tokenise+embed+retrieve+rerank+generate	4x H100 SXM5	Python + ONNX + TRT-LLM (TP=4)	1,200-2,000 q/s	Zero client round-trips.
MIG multi-tenant	7 tenants on 1 H100	1x H100 SXM5 80GB (7x 1g.10gb MIG)	TRT-LLM per slice	~600-900 tok/s per slice	Hard isolation per tenant.
Vision-only fleet	CLIP ViT-L/14 + classifier	1x L40S 48GB	ONNX Runtime + TensorRT	8,000-14,000 img/s	Dynamic batcher dominant.
Forest models (XGBoost)	Tabular ranker / fraud score	1x L4 24GB	FIL	150,000-400,000 row/s	Response cache adds 2-5x.
Voice + transcript	Whisper + LLM	1x H100 SXM5	PyTorch + TRT-LLM (TP=1)	Real-time	model_transaction_policy decoupled.
Sovereign multi-model	5 tenants on H200 MIG	1x H200 141GB (4x 2g.35gb + 1x 3g.71gb)	TRT-LLM + ONNX per slice	Mixed	UK / EU sovereign tenancies.

Limits and quotas#

Limit	Default	Hard ceiling	How to raise
max_batch_size	model-defined	Backend-dependent	Set in config.pbtxt; 0 disables batching (LLMs).
instance_group count per GPU	1	Memory-bounded	Raise to increase per-GPU concurrency; watch memory.
preferred_batch_size	model-defined	max_batch_size	Tune for the cliff in latency vs throughput.
max_queue_delay_microseconds	0	Application-defined	Trade p99 latency for throughput; 1-5ms typical.
Concurrent backends loaded	Unlimited	GPU memory	Limited by sum of per-backend footprints.
Models loaded simultaneously	Repository size	Memory	Use --model-control-mode=explicit for cold-tier.
Response cache size	0 (off)	--response-cache-byte-size	Set at process; per-model `response_cache { enable: true }`.
Request body size	Unlimited	HTTP server limit	Set --buffer-manager-thread-count and ingress timeouts.
Shared memory (Python/BLS)	/dev/shm	Container-defined	Mount >= 8GiB on multi-backend or TP>1 LLM deployments.
TP (within an LLM backend)	1	8 (NVLink island)	TensorRT-LLM / vLLM backend flag; not Triton itself.
File descriptors	1024	ulimit	ulimit -n 65536 in container.
Backend version compatibility	Pinned per Triton release	—	Pin Triton container tag; do not mix manually.

Observability#

nv_inference_count / nv_inference_exec_count — total and per-batch request counters.
nv_inference_request_duration_us — end-to-end p50/p95/p99 per model.
nv_inference_queue_duration_us — time waiting in the dynamic batcher; should be small fraction of request_duration.
nv_inference_compute_input_duration_us / compute_infer / compute_output — per-stage timing for kernel-level debugging.
nv_inference_pending_request_count — backlog depth per model.
nv_inference_request_success / nv_inference_request_failure — success counters; alert on failure rate.
nv_gpu_utilization, nv_gpu_memory_used_bytes — built-in per-GPU metrics (DCGM gives more).
Pair with DCGM_FI_DEV_GPU_UTIL / MEM_COPY_UTIL to disambiguate compute vs memory vs idle.

yaml

# Prometheus rules for a Triton deployment
groups:
  - name: triton-sla
    interval: 30s
    rules:
      - alert: TritonModelHighLatency
        expr: histogram_quantile(0.95,
                sum by (le, model) (
                  rate(nv_inference_request_duration_us_bucket[5m]))) > 500000
        for: 5m
        labels: { severity: warning, team: inference }
        annotations:
          summary: "Triton p95 > 500ms on {{ $labels.model }}"

      - alert: TritonQueueSaturation
        expr: sum by (model) (rate(nv_inference_queue_duration_us[5m])) /
              sum by (model) (rate(nv_inference_request_duration_us[5m])) > 0.4
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Triton queue time > 40% of request time on {{ $labels.model }} — add replicas"

      - alert: TritonRequestFailures
        expr: rate(nv_inference_request_failure[5m]) > 0.01 *
              rate(nv_inference_request_success[5m])
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Triton failure rate above 1% on {{ $labels.model }}"

      - alert: TritonGPUMemoryNearCap
        expr: nv_gpu_memory_used_bytes / nv_gpu_memory_total_bytes > 0.95
        for: 10m
        labels: { severity: warning }

      - alert: TritonModelUnloaded
        expr: changes(nv_inference_count[10m]) == 0 and
              nv_inference_pending_request_count > 0
        for: 15m
        labels: { severity: critical }
        annotations:
          summary: "Model {{ $labels.model }} pending requests but no executions — check load state"

Cost and FinOps#

Backend choice dominates per-model $/M tokens: pick the backend with the best $/M tokens for that model (TensorRT-LLM for stable LLMs, vLLM for fast-rotating LLMs, ONNX Runtime for small models).
Response cache turns identical requests into microsecond returns — for retrieval and tabular models this can lift effective throughput 2-5x.
Dynamic batcher tuning (`max_queue_delay_microseconds`) is a direct latency-throughput tradeoff worth ~30% throughput at the cost of 1-5ms p99 latency.
FOCUS-conformant billing exports from Yobitel tag each request with `triton_model` and `triton_backend` so $/M tokens can be sliced by model and backend.

Pattern	Before (FastAPI per model)	After (Triton multi-model)	Saving
3 small models on 3 L4 nodes	$1,800/month per node = $5,400	1x L40S node at $1,400	~74%
Vision + LLM + tabular on 3 H100	3x $2,300/month spot = $6,900	1x H100 SXM5 spot = $2,500	~64%
LLM + RAG embed/retrieve/rerank pipeline	5 services on shared GPU	1 Triton ensemble	1 deployment, 2-3x lower latency
MIG-partitioned 7 tenants	7x dedicated 8B endpoints	1x H100 with 7x MIG slices on Triton	~80% on dedicated GPU spend

Security and compliance#

Migration and alternatives#

From	Migration effort	Operational change	Notes
FastAPI per model (3-5 services)	Medium — write config.pbtxt per model	1 deployment instead of N; one set of metrics	Biggest operational simplification.
KServe + custom predictor	Low — KServe can use Triton as runtime	Keep KServe autoscaling, gain ensembles and multi-backend	InferenceService runtime: triton.
Hand-built C++ server	High — port to Triton backend API	Lose custom code; gain hardening and metrics	Worth it for any non-research deployment.
Stand-alone vLLM / SGLang	Low — wrap with vllm_backend	Gain multi-model surface; lose engine-native API quirks	Only do this if you also need other model types.
NVIDIA TensorRT-LLM stand-alone	Low — wrap with tensorrtllm_backend	Gain hardened HTTP/gRPC + metrics + model versioning	Recommended production pattern for TRT-LLM.
Seldon Core / BentoML	Medium — re-express as config.pbtxt	Lose framework-specific deploy ergonomics; gain throughput	Worth it at scale; less so for one model.

yaml

# KServe InferenceService using Triton as runtime (the recommended production pattern)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: llama3-70b-trtllm }
spec:
  predictor:
    triton:
      storageUri: "s3://ml-platform/models/llama3-70b-trtllm/"
      runtimeVersion: "25.06-py3"
      resources:
        limits:
          nvidia.com/gpu: "4"
          memory: "120Gi"
          cpu: "16"
      env:
        - { name: TRITON_MODEL_CONTROL_MODE, value: "explicit" }
      args:
        - --strict-readiness=true
        - --exit-on-error=true
        - --metrics-port=8002
      ports:
        - { containerPort: 8000, name: http }
        - { containerPort: 8001, name: grpc }
        - { containerPort: 8002, name: metrics }

Troubleshooting#

Symptom / Error	Cause	Fix
Model fails to load: 'failed to find backend'	Backend shared object missing from container.	Use the matching `-py3` tag; mount additional backends with --backend-directory.
NCCL hang during LLM model load	/dev/shm too small or NVLink P2P disabled.	Mount /dev/shm >= 8GiB; verify nvidia-smi nvlink --status.
High queue time, low compute time	Instance-group count too low for incoming RPS.	Raise instance_group count or add a second Triton replica.
LLM responses arrive only after final token	model_transaction_policy not set to decoupled.	Add `model_transaction_policy { decoupled: true }` to config.pbtxt.
Ensemble step times out	An interior step has lower max_batch_size than the ensemble batch.	Align max_batch_size across the chain or lower the ensemble batch.
GPU memory creeps after hot-load/unload cycles	Backend leaking memory across versions.	Pin to the latest backend; restart Triton on a rolling schedule until fixed.
Triton readiness fails with --strict-readiness	One model failed to load.	Check the failing model in `tritonserver --log-verbose=1` startup logs.
MIG instance group never matches	GPU mode not switched to MIG, or profile string mismatch.	nvidia-smi mig -lgi to list slices; match `profile:` exactly.
Response cache shows zero hits	Per-model `response_cache { enable: true }` not set, or `--response-cache-byte-size` is 0.	Set both at process and per-model.
High p99 spikes correlated with GC	Python backend running heavy per-request allocations.	Use shared-memory tensors; switch Python BLS to C++ where possible.
Backend version mismatch error	Manually swapping a backend .so under a non-matching Triton.	Use the official NGC container; do not mix backend versions.
KFServing v2 client sends wrong dtype	Schema in config.pbtxt expects FP16, client sends FP32.	Align client to schema; or add a Python preprocess model that casts.

Where this fits in the Yobitel stack#

References

Triton Inference Server on GitHub · GitHub (NVIDIA)
Triton Inference Server Documentation · NVIDIA
TensorRT-LLM Backend for Triton · GitHub
vLLM Backend for Triton · GitHub
Python Backend (PyTriton) · GitHub
KServe — Triton Runtime · KServe
NVIDIA Multi-Instance GPU (MIG) User Guide · NVIDIA

Triton Inference Server

Overview#

Quick start#

How it works#

Reference and specifications#

Workload patterns#

Sizing and capacity planning#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel

Triton Inference Server

Overview#

Quick start#

How it works#

Reference and specifications#

Workload patterns#

Sizing and capacity planning#

Limits and quotas#

Observability#

Cost and FinOps#

Security and compliance#

Migration and alternatives#

Troubleshooting#

Where this fits in the Yobitel stack#

References

Browse all entries

Deploy on Yobitel