Real-Time ASR

TL;DR

Real-time (streaming) ASR transcribes audio as it arrives, emitting partial hypotheses with latency typically targeted at 200-500 ms end-to-end.
It is fundamentally a different problem from offline ASR: the model must commit to outputs before seeing the rest of the utterance, trading accuracy for latency.
Three architectural choices dominate: streaming Conformer with RNN-T or CTC loss; chunked Whisper with overlapping windows; and hybrid attention-CTC models with limited right-context.
Serving choices matter as much as model choices — Triton Inference Server, TensorRT-LLM, and dedicated streaming runtimes (NeMo Stream, sherpa, faster-whisper-server) shape the latency floor.

What 'Real-Time' Means#

Real-time ASR is not just 'fast'. It is a specific operating mode where audio frames arrive incrementally and the system emits transcript hypotheses as they become available, before the speaker has finished. Three numbers define the regime: end-to-end latency (microphone to first token, typically 200-500 ms), emission delay (audio time to text time per word, typically 100-300 ms), and stability (how often partial hypotheses change before being finalised).

Voice agents, live captioning, call-centre coaching, and broadcast subtitling all live in this regime. Batch transcription of meeting recordings does not — and using a streaming model for batch is needlessly accuracy-limited.

Architectural Choices#

Three model families dominate production real-time ASR in 2026:

Family	Loss	Examples	Strengths
Streaming Conformer + RNN-T	Transducer	NVIDIA Parakeet RNN-T, Google USM	Natural streaming, low emission delay
Streaming Conformer + CTC	CTC	NeMo Conformer-CTC, Parakeet CTC	Simpler, very fast, slightly higher WER
Chunked Whisper	Cross-entropy	faster-whisper streaming, WhisperLive	Reuses excellent offline model, higher latency
Hybrid attention-CTC	Joint CTC + AED	WeNet U2++, ESPnet	Single model for stream + offline

Streaming Conformer with RNN-T#

The Recurrent Neural Network Transducer (RNN-T) loss, introduced by Alex Graves in 2012, is built for streaming. The encoder consumes audio frame by frame; a prediction network is an autoregressive language model over previously emitted labels; a joint network combines them and decides whether to emit a label or a blank at each step. Decoding proceeds without ever needing to wait for the end of the utterance.

Conformer + RNN-T with a limited right-context attention mask (e.g. 80 ms of look-ahead) is the dominant production design. NVIDIA's Parakeet RNN-T 1.1B is a representative open example, used inside NeMo and exported to Triton with TensorRT optimisations.

Streaming Whisper#

Whisper was trained on 30-second windows and is not natively a streaming model. Practitioners run it streaming-style by chunking incoming audio into overlapping windows (e.g. 5-10 s with 1-2 s overlap), running Whisper on each, and stitching hypotheses with simple heuristics or with a forced-alignment-based reconciliation.

End-to-end latency in this mode is typically 1-2 s — significantly worse than a native streaming Conformer — but accuracy benefits from Whisper's strong offline model. For use cases where 1-2 s is acceptable (live captioning, asynchronous voice agents), this is often the easiest path. faster-whisper-server, WhisperLive, and Whisper Streaming (Ufal) are common open implementations.

Naive chunked Whisper hallucinates at chunk boundaries because the model expects full sentences. Always pair it with a VAD that aligns chunk boundaries to silences, and treat partial transcripts as unstable until a downstream VAD-detected pause.

Serving Stack#

Model choice sets the accuracy ceiling; the serving stack sets the latency floor. Production streaming ASR systems run on:

Triton Inference Server — gRPC streaming endpoints, dynamic batching across concurrent streams, GPU memory pooling. The most common deployment target for NeMo and ESPnet models.
TensorRT and TensorRT-LLM — kernel-level optimisations for Conformer and Whisper variants, often combined with Triton.
sherpa / sherpa-onnx — Kaldi project's streaming runtime, runs Conformer and Zipformer models on CPU or GPU with low memory footprint.
Custom WebSocket gateways — accept Opus or PCM frames from clients, run VAD, route audio to GPU inference, stream back text tokens.

Hardware Choices#

Real-time ASR is memory-bandwidth bound at the per-stream level but throughput-scaling on a single GPU is excellent because models are small. A single NVIDIA L4 comfortably hosts dozens of concurrent streams of Conformer-CTC or Parakeet RNN-T at sub-300 ms latency; an L40S extends that into the low hundreds. For very large-scale call-centre or contact-centre workloads, L4 farms are typically more cost-effective than H100, since the H100's bandwidth is overkill for the model size.

On Yobibyte, real-time ASR is delivered through Triton-backed endpoints on L4 / L40S, with autoscaling tied to concurrent stream count rather than QPS — a streaming-native metric that matches the actual cost driver.

Latency Budget Example#

A typical voice agent latency budget from end-of-user-speech to start-of-agent-speech looks like:

VAD end-of-turn detection: 200-400 ms (depends on minimum silence threshold).
Final ASR transcript stabilisation: 100-200 ms after end-of-speech.
LLM time-to-first-token: 150-400 ms depending on model and prompt cache hit.
TTS time-to-first-byte: 100-300 ms (ElevenLabs Flash / Kokoro).
Network and playback buffering: 50-150 ms.

Total round-trip targets of 600-900 ms are achievable with careful pipelining (streaming ASR, prompt-cached LLM, streaming TTS). Sub-500 ms requires very aggressive end-of-turn detection and accepting more interruptions.

References

Sequence Transduction with Recurrent Neural Networks (Graves, 2012) · arXiv
NVIDIA NeMo streaming ASR documentation · NVIDIA
Whisper Streaming (Ufal) · GitHub
Triton Inference Server · GitHub

TL;DR

Real-time (streaming) ASR transcribes audio as it arrives, emitting partial hypotheses with latency typically targeted at 200-500 ms end-to-end.
It is fundamentally a different problem from offline ASR: the model must commit to outputs before seeing the rest of the utterance, trading accuracy for latency.
Three architectural choices dominate: streaming Conformer with RNN-T or CTC loss; chunked Whisper with overlapping windows; and hybrid attention-CTC models with limited right-context.
Serving choices matter as much as model choices — Triton Inference Server, TensorRT-LLM, and dedicated streaming runtimes (NeMo Stream, sherpa, faster-whisper-server) shape the latency floor.

What 'Real-Time' Means#

Architectural Choices#

Three model families dominate production real-time ASR in 2026:

Family	Loss	Examples	Strengths
Streaming Conformer + RNN-T	Transducer	NVIDIA Parakeet RNN-T, Google USM	Natural streaming, low emission delay
Streaming Conformer + CTC	CTC	NeMo Conformer-CTC, Parakeet CTC	Simpler, very fast, slightly higher WER
Chunked Whisper	Cross-entropy	faster-whisper streaming, WhisperLive	Reuses excellent offline model, higher latency
Hybrid attention-CTC	Joint CTC + AED	WeNet U2++, ESPnet	Single model for stream + offline

Triton Inference Server — gRPC streaming endpoints, dynamic batching across concurrent streams, GPU memory pooling. The most common deployment target for NeMo and ESPnet models.
TensorRT and TensorRT-LLM — kernel-level optimisations for Conformer and Whisper variants, often combined with Triton.
sherpa / sherpa-onnx — Kaldi project's streaming runtime, runs Conformer and Zipformer models on CPU or GPU with low memory footprint.
Custom WebSocket gateways — accept Opus or PCM frames from clients, run VAD, route audio to GPU inference, stream back text tokens.

Hardware Choices#

Latency Budget Example#

A typical voice agent latency budget from end-of-user-speech to start-of-agent-speech looks like:

VAD end-of-turn detection: 200-400 ms (depends on minimum silence threshold).
Final ASR transcript stabilisation: 100-200 ms after end-of-speech.
LLM time-to-first-token: 150-400 ms depending on model and prompt cache hit.
TTS time-to-first-byte: 100-300 ms (ElevenLabs Flash / Kokoro).
Network and playback buffering: 50-150 ms.

References

Sequence Transduction with Recurrent Neural Networks (Graves, 2012) · arXiv
NVIDIA NeMo streaming ASR documentation · NVIDIA
Whisper Streaming (Ufal) · GitHub
Triton Inference Server · GitHub

Real-Time ASR

What 'Real-Time' Means#

Architectural Choices#

Streaming Conformer with RNN-T#

Streaming Whisper#

Serving Stack#

Hardware Choices#

Latency Budget Example#

References

Browse all entries

Deploy on Yobitel

Real-Time ASR

What 'Real-Time' Means#

Architectural Choices#

Streaming Conformer with RNN-T#

Streaming Whisper#

Serving Stack#

Hardware Choices#

Latency Budget Example#

References

Browse all entries

Deploy on Yobitel