A/B and Canary Deployment

TL;DR

Standard release patterns adapted to LLM serving — route a percentage of live traffic to a new model or runtime version while keeping the rest on the stable one.
Canary: small percentage (1-10) of traffic; goal is to detect regressions before wide release.
A/B: traffic split into roughly equal cohorts; goal is to measure a difference between two variants.
Implemented at the serving layer (Triton model versions, KServe canary traffic, Istio header-based routing) and observed via per-version metrics.

Overview#

Updating an LLM in production is risky. A new model checkpoint, a new quantisation, a new runtime build or a tweaked system prompt can all subtly change response distributions in ways that unit tests miss. Canary and A/B deployment patterns let teams put new versions in front of real traffic before committing to them fleet-wide.

The two patterns share the same plumbing but differ in intent. A canary deployment seeks to catch regressions — small slice, watch for problems, roll back fast. An A/B test seeks to measure a difference — equal-ish slices, run long enough to gather statistical signal, decide which variant to keep.

Routing Mechanisms#

Triton model versions — `config.pbtxt` policy fields support `latest`, `specific` and `all` versions; mixed-version policies enable percentage routing.
KServe canary traffic split — built-in field on the `InferenceService` for percent-based traffic between revisions.
Istio / Envoy header-based routing — for header-driven cohorts (e.g. internal users, beta opt-ins).
Application-layer routing — the API gateway hashes a stable user identifier and routes deterministically per user.

Metrics That Matter#

LLM canaries cannot rely on HTTP status codes alone — a regressed model still returns 200 OK. Useful signals include: per-version latency percentiles, token-throughput, refusal rate, response length distribution, downstream tool-call frequency, user feedback signals (thumbs, retries, escalations) and qualitative LLM-as-judge evaluations comparing canary outputs to baseline.

For chat applications, a watchdog evaluator running a small reference set every few minutes against both versions catches large regressions within the canary window.

Treat 'response length grew 30 percent overnight' as a regression signal. It usually indicates the canary model is more verbose or less able to follow brevity instructions; users notice.

Stateful Considerations#

Multi-turn conversations complicate canary routing. If a user's first message is served by the canary and the follow-up by the stable version, the resulting context mismatch can cause incoherent outputs. The fix is sticky routing: hash on a conversation identifier and keep every turn on the same version.

Rollback#

Canary deployment is only as good as the rollback path. A working canary pipeline can revert traffic to the stable version in under a minute via routing rule change, with no requirement to scale down the canary infrastructure (which holds capacity for the next attempt).

References

KServe Canary Rollout Documentation · KServe
Triton Model Repository Documentation · NVIDIA
Istio Traffic Management · Istio

Overview#

Routing Mechanisms#

Triton model versions — `config.pbtxt` policy fields support `latest`, `specific` and `all` versions; mixed-version policies enable percentage routing.

KServe canary traffic split — built-in field on the `InferenceService` for percent-based traffic between revisions.

Istio / Envoy header-based routing — for header-driven cohorts (e.g. internal users, beta opt-ins).

Application-layer routing — the API gateway hashes a stable user identifier and routes deterministically per user.

Metrics That Matter#

For chat applications, a watchdog evaluator running a small reference set every few minutes against both versions catches large regressions within the canary window.

Treat 'response length grew 30 percent overnight' as a regression signal. It usually indicates the canary model is more verbose or less able to follow brevity instructions; users notice.

Stateful Considerations#

A/B and Canary Deployment

Overview#

Routing Mechanisms#

Metrics That Matter#

Stateful Considerations#

Rollback#

References

Browse all entries

Deploy on Yobitel

A/B and Canary Deployment

Overview#

Routing Mechanisms#

Metrics That Matter#

Stateful Considerations#

Rollback#

References

Browse all entries

Deploy on Yobitel