TL;DR
- Standard release patterns adapted to LLM serving — route a percentage of live traffic to a new model or runtime version while keeping the rest on the stable one.
- Canary: small percentage (1-10) of traffic; goal is to detect regressions before wide release.
- A/B: traffic split into roughly equal cohorts; goal is to measure a difference between two variants.
- Implemented at the serving layer (Triton model versions, KServe canary traffic, Istio header-based routing) and observed via per-version metrics.
Overview#
Updating an LLM in production is risky. A new model checkpoint, a new quantisation, a new runtime build or a tweaked system prompt can all subtly change response distributions in ways that unit tests miss. Canary and A/B deployment patterns let teams put new versions in front of real traffic before committing to them fleet-wide.
The two patterns share the same plumbing but differ in intent. A canary deployment seeks to catch regressions — small slice, watch for problems, roll back fast. An A/B test seeks to measure a difference — equal-ish slices, run long enough to gather statistical signal, decide which variant to keep.
Routing Mechanisms#
- Triton model versions — `config.pbtxt` policy fields support `latest`, `specific` and `all` versions; mixed-version policies enable percentage routing.
- KServe canary traffic split — built-in field on the `InferenceService` for percent-based traffic between revisions.
- Istio / Envoy header-based routing — for header-driven cohorts (e.g. internal users, beta opt-ins).
- Application-layer routing — the API gateway hashes a stable user identifier and routes deterministically per user.
Metrics That Matter#
LLM canaries cannot rely on HTTP status codes alone — a regressed model still returns 200 OK. Useful signals include: per-version latency percentiles, token-throughput, refusal rate, response length distribution, downstream tool-call frequency, user feedback signals (thumbs, retries, escalations) and qualitative LLM-as-judge evaluations comparing canary outputs to baseline.
For chat applications, a watchdog evaluator running a small reference set every few minutes against both versions catches large regressions within the canary window.
Treat 'response length grew 30 percent overnight' as a regression signal. It usually indicates the canary model is more verbose or less able to follow brevity instructions; users notice.
Stateful Considerations#
Multi-turn conversations complicate canary routing. If a user's first message is served by the canary and the follow-up by the stable version, the resulting context mismatch can cause incoherent outputs. The fix is sticky routing: hash on a conversation identifier and keep every turn on the same version.
Rollback#
Canary deployment is only as good as the rollback path. A working canary pipeline can revert traffic to the stable version in under a minute via routing rule change, with no requirement to scale down the canary infrastructure (which holds capacity for the next attempt).
References
- KServe Canary Rollout Documentation · KServe
- Triton Model Repository Documentation · NVIDIA
- Istio Traffic Management · Istio