TL;DR
- Release pattern where a new model receives a copy of production traffic but its responses are discarded.
- Used to compare latency, throughput and qualitative output of a candidate against the live model under realistic load.
- Carries no user risk (responses are not returned), but doubles inference cost during the shadow window.
- Commonly paired with LLM-as-judge or human evaluation of shadow outputs versus live responses.
Overview#
Shadow deployment — sometimes called mirror traffic or dark launching — duplicates inbound requests so that the live model handles the user-visible response while a candidate model handles a parallel inference. The candidate's response is logged, evaluated, then discarded. From the user's perspective nothing changes.
The pattern is older than LLMs and is borrowed from microservice deployment practice. For LLM serving it has particular value: response quality is hard to evaluate offline, and shadow runs provide realistic traffic to compare candidates against.
When to Use#
- Validating a new model checkpoint before any user-facing exposure.
- Comparing two prompt or system-prompt variants under live load.
- Benchmarking a new runtime (e.g. moving from vLLM to TensorRT-LLM) for latency and cost under real query distribution.
- Producing pairwise samples for LLM-as-judge evaluation.
Implementation#
At the service-mesh layer, Envoy supports request mirroring with a `request_mirror_policies` field that fires-and-forgets the duplicated request. The application gateway can also fork requests at the application layer, which is more flexible (it can log paired responses) at the cost of higher complexity.
Care is needed with side effects. If the candidate model triggers tool calls or writes state, those side effects must be sandboxed or stubbed; otherwise the shadow run will mutate production data twice.
# Envoy VirtualHost with mirrored traffic
routes:
- match:
prefix: "/v1/chat/completions"
route:
cluster: llm-stable
request_mirror_policies:
- cluster: llm-candidate
runtime_fraction:
default_value:
numerator: 100
denominator: HUNDREDEvaluation#
Shadow outputs are most useful when paired with the corresponding live output. A simple offline pipeline writes paired records to a log, then runs an LLM-as-judge comparison (or human review for sensitive workloads) to assess relative quality.
Latency and throughput comparisons are direct — measure tail latencies, token throughput and resource cost on the candidate fleet under the same load as the stable fleet.
Shadow deployment doubles inference cost for the duration of the window. Budget for it and time-box the experiment.
Limitations#
Shadow deployment only catches problems visible from observed traffic. Bugs that surface in user feedback loops (frustration, abandonment, retry behaviour) require canary deployment to detect. The two patterns are complementary: shadow first to validate, then canary for a small live-user sample, then full rollout.
References
- Envoy Request Mirroring Documentation · Envoy Proxy
- Istio Traffic Mirroring · Istio
- Google SRE Workbook: Canarying Releases · Google SRE