TL;DR
- General-purpose inference server from NVIDIA, originally TensorRT Inference Server, open-sourced 2018 and BSD 3-Clause licensed. Serves models from TensorRT, TensorRT-LLM, vLLM, ONNX Runtime, PyTorch, TensorFlow, OpenVINO, FIL (forest models), Python and custom C++ backends through one HTTP, gRPC and KFServing v2 endpoint.
- Provides dynamic batching, model ensembles, Business Logic Scripting, concurrent model execution, hot model load/unload, A/B and shadow routing, MIG-aware instance groups, Prometheus metrics and OpenTelemetry tracing — the production-grade glue around your engines.
- Distinct from TensorRT-LLM: Triton is the SERVER, TensorRT-LLM is one possible BACKEND. The `tensorrtllm_backend` makes Triton the recommended production deployment for NVIDIA-native LLM workloads; the `vllm_backend` does the same for vLLM engines.
- Distributed via the NVIDIA NGC container (`nvcr.io/nvidia/tritonserver:25.06-py3`), a community Helm chart, and as the default runtime for the KServe `InferenceService` CRD on Kubernetes.
- Standard production server in Yobitel's GPU Cloud and Yobibyte multi-model serving paths; the layer that lets a single GPU host vision, LLM, embedding and tabular models behind one endpoint with MIG-isolated tenancy.
Overview#
Triton Inference Server is NVIDIA's open-source serving runtime for production model inference. It began life in 2018 as TensorRT Inference Server, a thin serving wrapper around TensorRT engines, and has grown into a polyglot multi-model server: a single Triton process can today host a TensorRT engine, a TensorRT-LLM engine, a vLLM engine, a PyTorch TorchScript model, an ONNX classifier, a Python pre/post-processor and an XGBoost tabular model side by side, all behind one HTTP, gRPC and KFServing v2 inference API.
Its design centres on the model repository — a directory layout where each subdirectory is a versioned model with a `config.pbtxt` describing its backend, inputs, outputs, instance groups and scheduling policy. Triton watches this directory and hot-loads, hot-unloads and hot-updates models as files change, with optional polling or an explicit `repository/index` admin API. Per-model `instance_group` declarations pin replicas to GPU IDs, MIG slices, CPU cores or specific compute capabilities; per-model `dynamic_batching` declarations control how Triton coalesces requests; the `ensemble_scheduling` block expresses pipelines server-side as a static DAG of model calls.
The critical mental model is that Triton is a server, not an engine. It does not implement attention kernels, weight loading or autoregressive decoding directly — it delegates that work to backends. For LLM serving in production, the recommended split is to compile your model with TensorRT-LLM, deploy it behind Triton with the `tensorrtllm_backend`, and let Triton handle the HTTP / gRPC surface, request queuing, model versioning, MIG-aware concurrency and metrics. The same pattern applies for vLLM engines via the `vllm_backend`, and for ONNX / PyTorch / TensorFlow models via their respective backends.
By mid-2026 Triton ships from NVIDIA on a roughly monthly release cadence aligned with the NGC container index (24.10 was the September 2024 cut; 25.06 is the June 2026 cut shipping in the same image namespace). Backends move at their own cadence — TensorRT-LLM is monthly, vLLM is two- to three-weekly, ONNX Runtime is quarterly — and Triton pins compatible backend versions per release. Yobibyte exposes Triton as the opt-in serving runtime for multi-model and ensemble workloads — when a Yobitel customer needs to host an LLM, a vision encoder and a tabular ranker behind one endpoint, or wants MIG-isolated tenancy on shared H100 / H200 / B200 hardware, Triton is the layer the platform reaches for. The model repository and `config.pbtxt` are generated from the workspace definition rather than authored by hand.
This entry documents the production surface: the `tritonserver` CLI, the model-repository layout and `config.pbtxt` fields, the request lifecycle and dynamic batcher mechanics, the workload patterns where Triton wins, deployment, sizing, limits, observability hooks, costs and the migration story. This entry helps you stand up Triton for production multi-model serving with the right flags, sizing and operational practices — whether you are operating raw upstream on your own NVIDIA fleet or consuming Triton as the Yobibyte opt-in for multi-model and MIG-tenant workloads.
Quick start#
The example below builds a multi-model repository with a Llama 3.1 70B TensorRT-LLM engine, a CLIP ViT-L/14 image encoder served via ONNX Runtime, and a small Python pre-processor that resizes images, then launches Triton serving all three behind the same HTTP endpoint. The fourth snippet hits each model with `curl` to show the unified surface.
# 0. Model repository layout
mkdir -p models/llama3_70b/1 models/clip_image/1 models/preprocess/1
# 1. Llama 3.1 70B via the TensorRT-LLM backend
cat > models/llama3_70b/config.pbtxt <<'EOF'
name: "llama3_70b"
backend: "tensorrtllm"
max_batch_size: 64
model_transaction_policy { decoupled: true }
parameters: { key: "gpt_model_type" value: { string_value: "inflight_fused_batching" } }
parameters: { key: "gpt_model_path" value: { string_value: "/models/llama3_70b/1" } }
parameters: { key: "max_beam_width" value: { string_value: "1" } }
instance_group [{ count: 1, kind: KIND_GPU, gpus: [0,1,2,3] }]
EOF
# (then drop the pre-built TensorRT-LLM engine files into models/llama3_70b/1/)
# 2. CLIP image encoder via ONNX Runtime
cat > models/clip_image/config.pbtxt <<'EOF'
name: "clip_image"
backend: "onnxruntime"
max_batch_size: 32
input [{ name: "pixel_values", data_type: TYPE_FP16, dims: [3, 224, 224] }]
output [{ name: "image_embeds", data_type: TYPE_FP16, dims: [768] }]
dynamic_batching {
preferred_batch_size: [ 4, 8, 16, 32 ]
max_queue_delay_microseconds: 2000
}
instance_group [{ count: 2, kind: KIND_GPU, gpus: [4] }]
EOF
# 3. Python pre-processor (resize + normalise)
cat > models/preprocess/config.pbtxt <<'EOF'
name: "preprocess"
backend: "python"
max_batch_size: 32
input [{ name: "raw_image", data_type: TYPE_UINT8, dims: [-1, -1, 3] }]
output [{ name: "pixel_values", data_type: TYPE_FP16, dims: [3, 224, 224] }]
instance_group [{ count: 4, kind: KIND_CPU }]
EOF
# 4. Launch Triton
docker run --gpus all --rm --shm-size 8g \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $PWD/models:/models \
nvcr.io/nvidia/tritonserver:25.06-py3 \
tritonserver --model-repository=/models \
--model-control-mode=explicit \
--load-model=llama3_70b --load-model=clip_image --load-model=preprocess \
--metrics-port=8002
# 5. Inference (KFServing v2 / Triton REST)
curl http://localhost:8000/v2/models/clip_image/infer \
-H "Content-Type: application/json" \
-d '{
"inputs": [{ "name": "pixel_values", "shape": [1,3,224,224],
"datatype": "FP16", "data": [/* tensor floats */] }],
"outputs": [{ "name": "image_embeds" }]
}'
curl http://localhost:8000/v2/models/llama3_70b/generate \
-H "Content-Type: application/json" \
-d '{ "text_input": "Summarise Triton in 2 lines.", "max_tokens": 128 }'Mount `/dev/shm >= 8GiB` (`--shm-size 8g` on docker or an `emptyDir { medium: Memory }` of equivalent size on Kubernetes). Triton uses shared memory aggressively for the Python and BLS backends and for NCCL collectives in TP>1 LLM engines; the default 64MiB will OOM on the first big request.
How it works#
A Triton deployment has three concentric concepts: the model repository, the backend, and the request lifecycle. The model repository is a versioned directory; each model subdirectory contains a `config.pbtxt` (protobuf describing schema and scheduling) plus numbered version directories (`1/`, `2/`, …) containing the actual model artefacts (a `.plan` for TensorRT, an engine directory for TensorRT-LLM, a `.onnx` for ONNX Runtime, a `model.py` for the Python backend, and so on). Triton reads the config on load, validates the artefacts against it, and routes inference requests to the appropriate backend.
Backends are dynamically loaded shared objects implementing the Triton backend API. The TensorRT backend executes serialised TensorRT engines; the TensorRT-LLM backend wraps the TensorRT-LLM C++ runtime and bypasses the generic dynamic batcher in favour of in-flight (continuous) batching managed inside the backend itself; the vLLM backend wraps the vLLM Python engine; the ONNX Runtime backend executes ONNX graphs; the Python backend (and its PyTriton helper) runs arbitrary user Python code; the FIL backend executes XGBoost, LightGBM and scikit-learn forest models on GPU; the OpenVINO backend runs Intel-optimised CPU and integrated-GPU graphs.
The request lifecycle for an ordinary (non-LLM) model goes: client sends an HTTP / gRPC inference request; Triton authenticates and validates the request against the model's input schema; the dynamic batcher queues the request until either the configured queue delay elapses or the queue reaches a preferred batch size; the batcher hands the assembled batch to a model instance from the `instance_group` (one instance per GPU / MIG slice); the backend executes the forward pass; Triton splits the batched output back into per-request responses and returns them. For LLMs the path is different: TensorRT-LLM and vLLM backends operate in decoupled mode (`model_transaction_policy { decoupled: true }`), where Triton streams partial responses back as tokens are produced and the in-flight batching loop inside the backend manages concurrency directly.
Ensembles compose models server-side. The `ensemble_scheduling` block in `config.pbtxt` describes a static DAG of `step`s, each step calling another model in the repository with input tensors mapped from prior steps or from the original request. A common RAG ensemble is preprocess -> embed -> retrieve -> rerank -> generate, all executed inside Triton with no client round-trips between steps. Business Logic Scripting (BLS) extends this with dynamic Python control flow via the Python backend — useful when you need conditional branches, loops or calls to external services in the pipeline.
- Model repository: versioned directory of `config.pbtxt` plus per-version artefacts; watched for hot-load / hot-unload / hot-update.
- Backends: TensorRT, TensorRT-LLM, vLLM, ONNX Runtime, PyTorch (TorchScript and `torch.compile`), TensorFlow, OpenVINO, FIL, Python, custom C++.
- Dynamic batcher: per-model `preferred_batch_size`, `max_queue_delay_microseconds`, priority queues, request timeouts.
- Instance groups: pin model replicas to GPU IDs, MIG slices or CPU cores; `count` controls per-GPU concurrency.
- Model warmup: optional `model_warmup` block runs synthetic requests at load time so the first real request never pays JIT cost.
- Ensembles: server-side DAG of model calls expressed in `config.pbtxt`; zero client round-trips.
- Business Logic Scripting (BLS): dynamic Python pipelines via the Python backend with conditional control flow.
- MIG-aware: instance groups can target specific MIG slices for tenancy isolation on H100 / H200 / B200.
- Hot model control: `POST /v2/repository/models/{name}/load|unload` to swap models without restart.
- Protocols: HTTP / REST, gRPC streaming, KFServing v2; bundled OpenAI-compatible frontend for the TensorRT-LLM / vLLM backends.
For LLM workloads, always set `model_transaction_policy { decoupled: true }` on the `config.pbtxt` and rely on the TensorRT-LLM or vLLM backend's in-flight batcher. Triton's generic dynamic batcher coalesces requests at the model boundary, which is the wrong unit of work for autoregressive decoding — you want token-level batching, not request-level.
Reference and specifications#
Triton has two reference surfaces: the `tritonserver` CLI (process-level options like model repository, ports, model control mode, metrics) and the per-model `config.pbtxt` (model-level options like backend, schema, batching policy, instance groups). The table below is the canonical reference for the most-touched fields as of Triton 25.06 (June 2026). Fields not listed are either internal tuning knobs whose defaults are correct or specialised features documented in the upstream reference.
| Surface | Field / flag | Type | Description |
|---|---|---|---|
| CLI | --model-repository | path | Required. Root of the model repository directory tree. |
| CLI | --model-control-mode | enum | none (load all on start) | poll (re-scan periodically) | explicit (load/unload via API). |
| CLI | --load-model | list | Model names to load at startup when --model-control-mode=explicit. |
| CLI | --repository-poll-secs | int | Polling interval when --model-control-mode=poll. |
| CLI | --http-port / --grpc-port | int | API ports (default 8000 / 8001). |
| CLI | --metrics-port | int | Prometheus scrape port (default 8002). |
| CLI | --strict-readiness | bool | /v2/health/ready only returns 200 once every model loads. |
| CLI | --allow-cuda-graph | bool | Enables CUDA-graph capture for compatible backends. |
| CLI | --backend-config | k=v | Per-backend configuration, e.g. tensorrt,coalesce-request-input=true. |
| CLI | --trace-config | k=v | OpenTelemetry / native trace configuration. |
| CLI | --exit-on-error | bool | Fail-fast on any model load failure (recommended in production). |
| config.pbtxt | name | string | Model name as referenced by API; must match the directory name. |
| config.pbtxt | backend | string | tensorrt | tensorrtllm | vllm | onnxruntime | pytorch | tensorflow | openvino | python | fil | dali. |
| config.pbtxt | max_batch_size | int | 0 to disable batching (LLMs); otherwise the maximum batch the model can accept. |
| config.pbtxt | input / output | tensor list | Schema: name, datatype, dims; -1 marks dynamic dimensions. |
| config.pbtxt | dynamic_batching | block | preferred_batch_size, max_queue_delay_microseconds, priority_levels. |
| config.pbtxt | instance_group | list | count, kind (KIND_GPU / KIND_CPU / KIND_AUTO / KIND_MODEL), gpus, profile. |
| config.pbtxt | model_warmup | list | Synthetic requests run on load so first real request avoids JIT. |
| config.pbtxt | model_transaction_policy | block | decoupled: true required for streaming LLM responses. |
| config.pbtxt | ensemble_scheduling | block | Static DAG of step{ model_name, model_version, input_map, output_map }. |
| config.pbtxt | optimization | block | graph optimisation level (ONNX), cuda graphs, execution accelerators (TensorRT inside ONNX). |
| config.pbtxt | parameters | k=v map | Backend-specific options (e.g. gpt_model_path, max_beam_width for TensorRT-LLM). |
| config.pbtxt | version_policy | block | all | latest:N | specific [v1, v3]. |
| config.pbtxt | rate_limiter | block | resources required by a request; enables fair queuing across models. |
| config.pbtxt | response_cache | block | enable: true caches identical requests when --response-cache-byte-size is set. |
| API | POST /v2/models/{name}/infer | HTTP | KFServing v2 inference endpoint. |
| API | POST /v2/models/{name}/generate | HTTP | Bundled generate endpoint for TensorRT-LLM / vLLM backends. |
| API | POST /v2/repository/models/{name}/load | HTTP | Hot-load a model when --model-control-mode=explicit. |
| API | GET /metrics | HTTP | Prometheus metrics scrape endpoint. |
| API | GET /v2/health/{live,ready} | HTTP | Liveness and readiness probes. |
The `response_cache` block is often missed and is the single most cost-effective optimisation for retrieval, embedding and tabular models — identical requests return in microseconds with no model execution. Pair with `--response-cache-byte-size` at process level.
Workload patterns#
Triton's design pays off most on three workload shapes. Pattern A is multi-model serving — one GPU box hosting an LLM, a vision encoder, an embedding model and a tabular ranker behind one endpoint, which is the difference between four FastAPI deployments and a single Helm release. Pattern B is ensemble pipelines for RAG — preprocess, embed, retrieve, rerank, generate executed server-side with no client round-trips between stages. Pattern C is MIG-partitioned shared GPU serving — multiple tenants pinned to isolated MIG slices on the same H100 / H200 / B200 with hard memory and SM isolation. These are the three patterns Yobibyte routes to a Triton-backed runtime — a team standing this up on raw upstream signs up to author the `config.pbtxt` files, manage backend version pins and operate the MIG slicing themselves; the Yobibyte workspace generates and operates them.
# A — multi-model on a single 8-GPU H100 box (LLM + CLIP + tabular)
# See the Quick start config.pbtxt files; launch with:
docker run --gpus all --shm-size 8g \
-v $PWD/models:/models -p 8000:8000 -p 8002:8002 \
nvcr.io/nvidia/tritonserver:25.06-py3 \
tritonserver --model-repository=/models --metrics-port=8002
# B — server-side RAG ensemble (preprocess -> embed -> retrieve -> rerank -> generate)
cat > models/rag_pipeline/config.pbtxt <<'EOF'
name: "rag_pipeline"
platform: "ensemble"
max_batch_size: 16
input [{ name: "query", data_type: TYPE_STRING, dims: [1] }]
output [{ name: "answer", data_type: TYPE_STRING, dims: [1] }]
ensemble_scheduling {
step [
{ model_name: "tokenize_query" model_version: -1
input_map { key: "TEXT" value: "query" }
output_map { key: "TOKENS" value: "q_tok" } },
{ model_name: "embed_query" model_version: -1
input_map { key: "TOKENS" value: "q_tok" }
output_map { key: "EMBED" value: "q_vec" } },
{ model_name: "vector_search" model_version: -1
input_map { key: "VEC" value: "q_vec" }
output_map { key: "CTX" value: "ctx" } },
{ model_name: "rerank" model_version: -1
input_map { key: "Q" value: "query"
key: "CTX" value: "ctx" }
output_map { key: "CTX2" value: "ctx2" } },
{ model_name: "llama3_70b" model_version: -1
input_map { key: "text_input" value: "ctx2" }
output_map { key: "text_output" value: "answer" } }
]
}
EOF
# C — MIG-isolated multi-tenant on H100 (instance group pinned per MIG slice)
# nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C creates 7x 1g.10gb slices on each H100;
# tenant_a's config.pbtxt then pins to one of them:
cat > models/tenant_a_llama8b/config.pbtxt <<'EOF'
name: "tenant_a_llama8b"
backend: "tensorrtllm"
max_batch_size: 16
instance_group [{ count: 1, kind: KIND_GPU, gpus: [0]
profile: ["MIG-3g.40gb"] }]
parameters: { key: "gpt_model_path" value: { string_value: "/models/tenant_a_llama8b/1" } }
EOFIf the workload is a single LLM with no companion models and the goal is the simplest path from HF repo id to an OpenAI-compatible endpoint, you do not need Triton — run vLLM or SGLang standalone. Reach for Triton when (a) you have multiple model types, (b) you need server-side ensembles, or (c) you need MIG-isolated multi-tenant serving on one box.
Sizing and capacity planning#
Sizing for Triton is mostly sizing for the underlying backend — a Llama 3.1 70B served via the TensorRT-LLM backend sizes the same as a stand-alone TensorRT-LLM deployment; a CLIP encoder sizes the same as the same ONNX Runtime engine elsewhere. Triton itself adds a small overhead (typically <2GB GPU memory per backend loaded and <0.5ms per request at the dynamic batcher) and the instance-group count multiplies the per-replica memory footprint.
The sizing table below is for the multi-model and ensemble patterns where Triton's footprint matters; pure single-LLM-engine sizing should be read from the TensorRT-LLM or vLLM entry directly. All throughput figures are mid-range observed values from InferenceBench v3 with the noted backend; treat as planning anchors.
| Workload | Mix | Recommended SKU | Backend split | Throughput | Notes |
|---|---|---|---|---|---|
| Multi-model serving | 1 LLM + 1 vision + 1 tabular | 8x H100 SXM5 | TRT-LLM (TP=4) + ONNX + FIL | see backends | Single box replaces 3 deployments. |
| RAG ensemble | tokenise+embed+retrieve+rerank+generate | 4x H100 SXM5 | Python + ONNX + TRT-LLM (TP=4) | 1,200-2,000 q/s | Zero client round-trips. |
| MIG multi-tenant | 7 tenants on 1 H100 | 1x H100 SXM5 80GB (7x 1g.10gb MIG) | TRT-LLM per slice | ~600-900 tok/s per slice | Hard isolation per tenant. |
| Vision-only fleet | CLIP ViT-L/14 + classifier | 1x L40S 48GB | ONNX Runtime + TensorRT | 8,000-14,000 img/s | Dynamic batcher dominant. |
| Forest models (XGBoost) | Tabular ranker / fraud score | 1x L4 24GB | FIL | 150,000-400,000 row/s | Response cache adds 2-5x. |
| Voice + transcript | Whisper + LLM | 1x H100 SXM5 | PyTorch + TRT-LLM (TP=1) | Real-time | model_transaction_policy decoupled. |
| Sovereign multi-model | 5 tenants on H200 MIG | 1x H200 141GB (4x 2g.35gb + 1x 3g.71gb) | TRT-LLM + ONNX per slice | Mixed | UK / EU sovereign tenancies. |
Limits and quotas#
Triton enforces a small set of hard limits at the API and a larger set of per-model limits expressed in `config.pbtxt`. Operational ceilings come from the host OS, CUDA runtime and the underlying backend; the table below is the Triton-specific layer that operators tune most often.
| Limit | Default | Hard ceiling | How to raise |
|---|---|---|---|
| max_batch_size | model-defined | Backend-dependent | Set in config.pbtxt; 0 disables batching (LLMs). |
| instance_group count per GPU | 1 | Memory-bounded | Raise to increase per-GPU concurrency; watch memory. |
| preferred_batch_size | model-defined | max_batch_size | Tune for the cliff in latency vs throughput. |
| max_queue_delay_microseconds | 0 | Application-defined | Trade p99 latency for throughput; 1-5ms typical. |
| Concurrent backends loaded | Unlimited | GPU memory | Limited by sum of per-backend footprints. |
| Models loaded simultaneously | Repository size | Memory | Use --model-control-mode=explicit for cold-tier. |
| Response cache size | 0 (off) | --response-cache-byte-size | Set at process; per-model `response_cache { enable: true }`. |
| Request body size | Unlimited | HTTP server limit | Set --buffer-manager-thread-count and ingress timeouts. |
| Shared memory (Python/BLS) | /dev/shm | Container-defined | Mount >= 8GiB on multi-backend or TP>1 LLM deployments. |
| TP (within an LLM backend) | 1 | 8 (NVLink island) | TensorRT-LLM / vLLM backend flag; not Triton itself. |
| File descriptors | 1024 | ulimit | ulimit -n 65536 in container. |
| Backend version compatibility | Pinned per Triton release | — | Pin Triton container tag; do not mix manually. |
Observability#
Triton exposes a Prometheus metrics endpoint at `/metrics` (default port 8002) with per-model request count, request duration, queue duration, compute duration, batch size distribution and pending requests, plus per-GPU DCGM-style memory and utilisation metrics when DCGM exporter is co-deployed. The TensorRT-LLM and vLLM backends layer their own engine-level metrics on top (`nv_inference_*` and `vllm:*` respectively). Tracing uses OpenTelemetry via `--trace-config` and emits one span per request, with sub-spans per ensemble step.
The metrics worth alerting on in production are queue-vs-compute time ratio (saturation), per-model success rate, GPU memory headroom, and any spike in `nv_inference_pending_request_count`. The following Prometheus rules cover the common failure modes.
- nv_inference_count / nv_inference_exec_count — total and per-batch request counters.
- nv_inference_request_duration_us — end-to-end p50/p95/p99 per model.
- nv_inference_queue_duration_us — time waiting in the dynamic batcher; should be small fraction of request_duration.
- nv_inference_compute_input_duration_us / compute_infer / compute_output — per-stage timing for kernel-level debugging.
- nv_inference_pending_request_count — backlog depth per model.
- nv_inference_request_success / nv_inference_request_failure — success counters; alert on failure rate.
- nv_gpu_utilization, nv_gpu_memory_used_bytes — built-in per-GPU metrics (DCGM gives more).
- Pair with DCGM_FI_DEV_GPU_UTIL / MEM_COPY_UTIL to disambiguate compute vs memory vs idle.
# Prometheus rules for a Triton deployment
groups:
- name: triton-sla
interval: 30s
rules:
- alert: TritonModelHighLatency
expr: histogram_quantile(0.95,
sum by (le, model) (
rate(nv_inference_request_duration_us_bucket[5m]))) > 500000
for: 5m
labels: { severity: warning, team: inference }
annotations:
summary: "Triton p95 > 500ms on {{ $labels.model }}"
- alert: TritonQueueSaturation
expr: sum by (model) (rate(nv_inference_queue_duration_us[5m])) /
sum by (model) (rate(nv_inference_request_duration_us[5m])) > 0.4
for: 10m
labels: { severity: warning }
annotations:
summary: "Triton queue time > 40% of request time on {{ $labels.model }} — add replicas"
- alert: TritonRequestFailures
expr: rate(nv_inference_request_failure[5m]) > 0.01 *
rate(nv_inference_request_success[5m])
for: 5m
labels: { severity: critical }
annotations:
summary: "Triton failure rate above 1% on {{ $labels.model }}"
- alert: TritonGPUMemoryNearCap
expr: nv_gpu_memory_used_bytes / nv_gpu_memory_total_bytes > 0.95
for: 10m
labels: { severity: warning }
- alert: TritonModelUnloaded
expr: changes(nv_inference_count[10m]) == 0 and
nv_inference_pending_request_count > 0
for: 15m
labels: { severity: critical }
annotations:
summary: "Model {{ $labels.model }} pending requests but no executions — check load state"Cost and FinOps#
Triton's direct cost impact comes from three levers: instance-group multiplication (each replica multiplies GPU memory), backend selection (TensorRT-LLM compiled engine vs vLLM PyTorch vs raw ONNX has materially different $/M tokens), and consolidation savings (one Triton box replaces three or four FastAPI services). For LLM workloads, $/M tokens follows the TensorRT-LLM or vLLM cost model directly; the Triton overhead is ~1-3% on throughput and effectively zero on per-token cost.
Consolidation is the under-counted FinOps win. A team running CLIP, BERT and Llama 3.1 8B as three separate FastAPI services on three single-GPU nodes typically pays for 3x the underused capacity, plus three sets of ops on-call. The same workload on a single Triton box with three instance groups runs at higher aggregate utilisation and removes two of the three services from the on-call rota.
- Backend choice dominates per-model $/M tokens: pick the backend with the best $/M tokens for that model (TensorRT-LLM for stable LLMs, vLLM for fast-rotating LLMs, ONNX Runtime for small models).
- Response cache turns identical requests into microsecond returns — for retrieval and tabular models this can lift effective throughput 2-5x.
- Dynamic batcher tuning (`max_queue_delay_microseconds`) is a direct latency-throughput tradeoff worth ~30% throughput at the cost of 1-5ms p99 latency.
- FOCUS-conformant billing exports from Yobitel tag each request with `triton_model` and `triton_backend` so $/M tokens can be sliced by model and backend.
| Pattern | Before (FastAPI per model) | After (Triton multi-model) | Saving |
|---|---|---|---|
| 3 small models on 3 L4 nodes | $1,800/month per node = $5,400 | 1x L40S node at $1,400 | ~74% |
| Vision + LLM + tabular on 3 H100 | 3x $2,300/month spot = $6,900 | 1x H100 SXM5 spot = $2,500 | ~64% |
| LLM + RAG embed/retrieve/rerank pipeline | 5 services on shared GPU | 1 Triton ensemble | 1 deployment, 2-3x lower latency |
| MIG-partitioned 7 tenants | 7x dedicated 8B endpoints | 1x H100 with 7x MIG slices on Triton | ~80% on dedicated GPU spend |
Security and compliance#
Triton ships with optional HTTPS, gRPC TLS, mTLS and client-certificate authentication via the `--http-restricted-api` and `--ssl-*` flags; production deployments terminate TLS at an ingress (Envoy, NGINX, AWS ALB) and apply signed-JWT or mTLS at that layer. The model repository should be mounted read-only in production; combined with `--exit-on-error` and a CI-driven repository deploy, this makes the running set of models a deterministic CI artefact rather than a mutable directory.
Multi-tenant isolation has two patterns. MIG slices give hard memory and SM isolation on H100 / H200 / B200 — each tenant's `instance_group` pins to a dedicated MIG profile and tenants cannot observe each other's memory or compute. Soft isolation via separate Triton instances behind a router (one Triton per tenant) is simpler but uses more capacity. The `rate_limiter` block applies fair queuing across models on the same instance.
Regulatory posture follows the backends. For UK public-sector workloads, deploy Triton on Yobitel sovereign tenancies satisfying NCSC Cloud Security Principles and G-Cloud 14, with MIG-isolated tenants and read-only repositories. For EU GDPR, the server processes request and response data only in volatile memory and the on-disk scratch path; encrypt ephemeral storage. For US HIPAA, run inside a BAA-covered VPC; for FedRAMP, run the FIPS-validated CUDA build and pin NIAP-approved cipher suites at the ingress.
Migration and alternatives#
Most production migrations to Triton come from one of three origins: raw FastAPI / Flask per model (the most common, biggest operational win), KServe with a custom predictor (when you need ensembles or multi-backend more than autoscaling), or hand-built C++ inference servers (a rare niche from earlier deployments). The decision matrix is straightforward: if you have one model and one engine, stay with the stand-alone engine server (vLLM, SGLang, TensorRT-LLM behind a thin frontend); if you have multiple model types, ensembles, or MIG-partitioned multi-tenancy, Triton is the right level of abstraction.
| From | Migration effort | Operational change | Notes |
|---|---|---|---|
| FastAPI per model (3-5 services) | Medium — write config.pbtxt per model | 1 deployment instead of N; one set of metrics | Biggest operational simplification. |
| KServe + custom predictor | Low — KServe can use Triton as runtime | Keep KServe autoscaling, gain ensembles and multi-backend | InferenceService runtime: triton. |
| Hand-built C++ server | High — port to Triton backend API | Lose custom code; gain hardening and metrics | Worth it for any non-research deployment. |
| Stand-alone vLLM / SGLang | Low — wrap with vllm_backend | Gain multi-model surface; lose engine-native API quirks | Only do this if you also need other model types. |
| NVIDIA TensorRT-LLM stand-alone | Low — wrap with tensorrtllm_backend | Gain hardened HTTP/gRPC + metrics + model versioning | Recommended production pattern for TRT-LLM. |
| Seldon Core / BentoML | Medium — re-express as config.pbtxt | Lose framework-specific deploy ergonomics; gain throughput | Worth it at scale; less so for one model. |
# KServe InferenceService using Triton as runtime (the recommended production pattern)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: llama3-70b-trtllm }
spec:
predictor:
triton:
storageUri: "s3://ml-platform/models/llama3-70b-trtllm/"
runtimeVersion: "25.06-py3"
resources:
limits:
nvidia.com/gpu: "4"
memory: "120Gi"
cpu: "16"
env:
- { name: TRITON_MODEL_CONTROL_MODE, value: "explicit" }
args:
- --strict-readiness=true
- --exit-on-error=true
- --metrics-port=8002
ports:
- { containerPort: 8000, name: http }
- { containerPort: 8001, name: grpc }
- { containerPort: 8002, name: metrics }KServe's `runtime: triton` is the production-grade default for multi-backend model serving on Kubernetes. You gain Knative-style scale-to-zero, transformer/predictor split, and canary routing on top of Triton's multi-model surface.
Troubleshooting#
The error table below covers the failure modes that account for roughly 80% of production Triton incidents observed on Yobitel-operated fleets. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix.
| Symptom / Error | Cause | Fix |
|---|---|---|
| Model fails to load: 'failed to find backend' | Backend shared object missing from container. | Use the matching `-py3` tag; mount additional backends with --backend-directory. |
| NCCL hang during LLM model load | /dev/shm too small or NVLink P2P disabled. | Mount /dev/shm >= 8GiB; verify nvidia-smi nvlink --status. |
| High queue time, low compute time | Instance-group count too low for incoming RPS. | Raise instance_group count or add a second Triton replica. |
| LLM responses arrive only after final token | model_transaction_policy not set to decoupled. | Add `model_transaction_policy { decoupled: true }` to config.pbtxt. |
| Ensemble step times out | An interior step has lower max_batch_size than the ensemble batch. | Align max_batch_size across the chain or lower the ensemble batch. |
| GPU memory creeps after hot-load/unload cycles | Backend leaking memory across versions. | Pin to the latest backend; restart Triton on a rolling schedule until fixed. |
| Triton readiness fails with --strict-readiness | One model failed to load. | Check the failing model in `tritonserver --log-verbose=1` startup logs. |
| MIG instance group never matches | GPU mode not switched to MIG, or profile string mismatch. | nvidia-smi mig -lgi to list slices; match `profile:` exactly. |
| Response cache shows zero hits | Per-model `response_cache { enable: true }` not set, or `--response-cache-byte-size` is 0. | Set both at process and per-model. |
| High p99 spikes correlated with GC | Python backend running heavy per-request allocations. | Use shared-memory tensors; switch Python BLS to C++ where possible. |
| Backend version mismatch error | Manually swapping a backend .so under a non-matching Triton. | Use the official NGC container; do not mix backend versions. |
| KFServing v2 client sends wrong dtype | Schema in config.pbtxt expects FP16, client sends FP32. | Align client to schema; or add a Python preprocess model that casts. |
Where this fits in the Yobitel stack#
Triton Inference Server is the production-grade serving layer that sits between Yobitel's GPU Cloud and the runtime engines (vLLM, TensorRT-LLM, ONNX Runtime, FIL). When Yobibyte customers deploy a multi-model workload — an LLM plus a vision encoder plus a tabular ranker — or a server-side RAG ensemble, Triton is the runtime that hosts it. Customers do not normally interact with the `config.pbtxt` directly; the Yobibyte console generates the repository from a higher-level workspace definition, but the engine underneath is Triton with the appropriate backend.
For Yobitel's MIG-isolated multi-tenant offerings, Triton is the layer that pins each tenant's models to a dedicated MIG slice on shared H100 / H200 / B200 hardware. The hard memory and SM isolation that MIG provides at the silicon level is exposed to tenants through Triton's instance-group `profile:` field — each tenant's models can only execute on their own MIG slice, and Triton's rate limiter ensures fair API queuing across tenants on the same instance.
For UK and EU sovereign workloads, Triton runs on the Yobitel London-1 and Frankfurt-1 regions inside tenancies that satisfy NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat. The combination of a hardened open-source server (BSD 3-Clause), sovereign hardware, MIG-isolated tenancy and transparent benchmark scoring on InferenceBench is what lets Yobitel customers run multi-model production workloads in regulated environments behind a single API endpoint.
References
- Triton Inference Server on GitHub · GitHub (NVIDIA)
- Triton Inference Server Documentation · NVIDIA
- TensorRT-LLM Backend for Triton · GitHub
- vLLM Backend for Triton · GitHub
- Python Backend (PyTriton) · GitHub
- KServe — Triton Runtime · KServe
- NVIDIA Multi-Instance GPU (MIG) User Guide · NVIDIA