TL;DR
- Real-time (streaming) ASR transcribes audio as it arrives, emitting partial hypotheses with latency typically targeted at 200-500 ms end-to-end.
- It is fundamentally a different problem from offline ASR: the model must commit to outputs before seeing the rest of the utterance, trading accuracy for latency.
- Three architectural choices dominate: streaming Conformer with RNN-T or CTC loss; chunked Whisper with overlapping windows; and hybrid attention-CTC models with limited right-context.
- Serving choices matter as much as model choices — Triton Inference Server, TensorRT-LLM, and dedicated streaming runtimes (NeMo Stream, sherpa, faster-whisper-server) shape the latency floor.
What 'Real-Time' Means#
Real-time ASR is not just 'fast'. It is a specific operating mode where audio frames arrive incrementally and the system emits transcript hypotheses as they become available, before the speaker has finished. Three numbers define the regime: end-to-end latency (microphone to first token, typically 200-500 ms), emission delay (audio time to text time per word, typically 100-300 ms), and stability (how often partial hypotheses change before being finalised).
Voice agents, live captioning, call-centre coaching, and broadcast subtitling all live in this regime. Batch transcription of meeting recordings does not — and using a streaming model for batch is needlessly accuracy-limited.
Architectural Choices#
Three model families dominate production real-time ASR in 2026:
| Family | Loss | Examples | Strengths |
|---|---|---|---|
| Streaming Conformer + RNN-T | Transducer | NVIDIA Parakeet RNN-T, Google USM | Natural streaming, low emission delay |
| Streaming Conformer + CTC | CTC | NeMo Conformer-CTC, Parakeet CTC | Simpler, very fast, slightly higher WER |
| Chunked Whisper | Cross-entropy | faster-whisper streaming, WhisperLive | Reuses excellent offline model, higher latency |
| Hybrid attention-CTC | Joint CTC + AED | WeNet U2++, ESPnet | Single model for stream + offline |
Streaming Conformer with RNN-T#
The Recurrent Neural Network Transducer (RNN-T) loss, introduced by Alex Graves in 2012, is built for streaming. The encoder consumes audio frame by frame; a prediction network is an autoregressive language model over previously emitted labels; a joint network combines them and decides whether to emit a label or a blank at each step. Decoding proceeds without ever needing to wait for the end of the utterance.
Conformer + RNN-T with a limited right-context attention mask (e.g. 80 ms of look-ahead) is the dominant production design. NVIDIA's Parakeet RNN-T 1.1B is a representative open example, used inside NeMo and exported to Triton with TensorRT optimisations.
Streaming Whisper#
Whisper was trained on 30-second windows and is not natively a streaming model. Practitioners run it streaming-style by chunking incoming audio into overlapping windows (e.g. 5-10 s with 1-2 s overlap), running Whisper on each, and stitching hypotheses with simple heuristics or with a forced-alignment-based reconciliation.
End-to-end latency in this mode is typically 1-2 s — significantly worse than a native streaming Conformer — but accuracy benefits from Whisper's strong offline model. For use cases where 1-2 s is acceptable (live captioning, asynchronous voice agents), this is often the easiest path. faster-whisper-server, WhisperLive, and Whisper Streaming (Ufal) are common open implementations.
Naive chunked Whisper hallucinates at chunk boundaries because the model expects full sentences. Always pair it with a VAD that aligns chunk boundaries to silences, and treat partial transcripts as unstable until a downstream VAD-detected pause.
Serving Stack#
Model choice sets the accuracy ceiling; the serving stack sets the latency floor. Production streaming ASR systems run on:
- Triton Inference Server — gRPC streaming endpoints, dynamic batching across concurrent streams, GPU memory pooling. The most common deployment target for NeMo and ESPnet models.
- TensorRT and TensorRT-LLM — kernel-level optimisations for Conformer and Whisper variants, often combined with Triton.
- sherpa / sherpa-onnx — Kaldi project's streaming runtime, runs Conformer and Zipformer models on CPU or GPU with low memory footprint.
- Custom WebSocket gateways — accept Opus or PCM frames from clients, run VAD, route audio to GPU inference, stream back text tokens.
Hardware Choices#
Real-time ASR is memory-bandwidth bound at the per-stream level but throughput-scaling on a single GPU is excellent because models are small. A single NVIDIA L4 comfortably hosts dozens of concurrent streams of Conformer-CTC or Parakeet RNN-T at sub-300 ms latency; an L40S extends that into the low hundreds. For very large-scale call-centre or contact-centre workloads, L4 farms are typically more cost-effective than H100, since the H100's bandwidth is overkill for the model size.
On Yobibyte, real-time ASR is delivered through Triton-backed endpoints on L4 / L40S, with autoscaling tied to concurrent stream count rather than QPS — a streaming-native metric that matches the actual cost driver.
Latency Budget Example#
A typical voice agent latency budget from end-of-user-speech to start-of-agent-speech looks like:
- VAD end-of-turn detection: 200-400 ms (depends on minimum silence threshold).
- Final ASR transcript stabilisation: 100-200 ms after end-of-speech.
- LLM time-to-first-token: 150-400 ms depending on model and prompt cache hit.
- TTS time-to-first-byte: 100-300 ms (ElevenLabs Flash / Kokoro).
- Network and playback buffering: 50-150 ms.
Total round-trip targets of 600-900 ms are achievable with careful pipelining (streaming ASR, prompt-cached LLM, streaming TTS). Sub-500 ms requires very aggressive end-of-turn detection and accepting more interruptions.