TL;DR
- Voice Activity Detection (VAD) is the task of classifying short audio frames as speech or non-speech. It is the first stage of essentially every production ASR, diarisation, and voice-agent pipeline.
- Three families dominate: energy / spectral threshold methods (cheap, brittle), classical statistical methods (WebRTC VAD), and small neural models (Silero VAD).
- Silero VAD is the de facto open default in 2026 — a small (~1-2 MB) neural model that runs in real time on CPU, with strong noise robustness.
- VAD's job is not perfect speech / non-speech classification but trading off two errors with known costs: false positives waste downstream compute and hallucinate transcripts; false negatives clip the start or end of utterances.
Why VAD Matters#
Modern ASR models like Whisper are expensive to run and have a documented tendency to hallucinate plausible-but-invented transcripts when fed silence or long stretches of non-speech audio. Diarisation systems need clean speech segments to cluster correctly. Voice agents need to detect end-of-turn to know when to start responding. Every one of these depends on a reliable VAD.
A good VAD is invisible: it gates the right audio through and discards the rest, so downstream models only see what they were trained on. A bad VAD shows up as either lost words at the start of utterances, or hallucinated transcripts during silence.
The Three Families#
VAD techniques fall along a clear quality-vs-cost spectrum:
- Energy / spectral thresholds — measure short-term energy, zero-crossing rate, or spectral flatness against a threshold. Effectively free. Works in clean studio audio. Falls apart with background noise, reverberation, or non-stationary noise like keyboard typing.
- WebRTC VAD — Google's classic 2011 GMM-based detector shipped in Chromium. Uses six spectral sub-bands and Gaussian Mixture Models to score each 10/20/30 ms frame. Tiny, deterministic, robust enough for VoIP, available everywhere — still a sensible default for low-power edge devices.
- Silero VAD — a small neural model (a few-million-parameter MLP / CNN hybrid) released open-source by the Silero team. Strong noise robustness, supports multilingual speech, runs at well under real time on a single CPU core. The default first stage in faster-whisper, WhisperX, and many open voice agent stacks.
- Pyannote segmentation / VAD — heavier neural models from the pyannote.audio toolkit. Higher accuracy than Silero on hard conditions, GPU-friendly, used as a joint segmentation + speaker-change detector inside diarisation pipelines.
Tuning Trade-offs#
Every VAD exposes a small number of knobs: a speech-probability threshold, a minimum speech duration to count as a segment, a minimum silence duration to end a segment, and pre-roll / post-roll padding around detected speech.
- Higher threshold → fewer false positives, more chance of clipping low-energy speech (e.g. soft fricatives, quiet endings).
- Lower threshold → catches everything, but lets noise through and increases downstream cost.
- Longer minimum silence → fewer fragmented segments, but slower turn-detection for voice agents.
- Pre-roll padding (typically 100-300 ms) prevents missing the leading consonant of a word.
Tune VAD on representative audio, not on clean LibriSpeech. The defaults shipped with toolkits are typically calibrated for studio conditions and will under-trigger in real telephony or open-office scenarios.
Where VAD Sits in the Stack#
In a typical real-time voice agent, VAD runs immediately after the microphone capture (or codec decoder for telephony), before ASR, before diarisation, and before any LLM call. It is the cheapest stage in the pipeline by orders of magnitude, so most systems run it on every frame and only invoke ASR on the segments it marks as speech.
Yobibyte's voice agent reference architecture uses Silero VAD as the default frame-level detector and pyannote segmentation as a secondary check for diarisation use cases, with both stages running on a CPU sidecar to keep GPU capacity reserved for ASR and the LLM.
References
- snakers4/silero-vad · GitHub
- WebRTC VAD (Chromium source) · Chromium
- pyannote.audio segmentation model · Hugging Face