Speaker Diarisation

TL;DR

Speaker diarisation is the task of partitioning an audio recording into homogeneous speaker segments — typically without knowing speaker identities in advance.
A diarisation system answers 'who spoke when'; combined with an ASR system it produces speaker-attributed transcripts (Speaker A: 'Good morning'; Speaker B: 'Hi').
pyannote.audio is the dominant open toolkit and ships state-of-the-art end-to-end models; sherpa-onnx provides an ONNX-runnable equivalent for low-latency and edge deployment.
Diarisation Error Rate (DER) is the standard metric — the sum of speaker confusion, missed speech, and false alarm time divided by total reference speech time. Production systems target DER in the high single digits on clean meetings.

The Problem#

An hour of meeting audio with no metadata is nearly useless. The same hour, segmented by speaker, becomes searchable, summarisable, and analysable. Diarisation is the bridge — and unlike ASR, the system does not need to know who the speakers are by name. It only needs to be consistent: every segment from speaker A should be labelled 'speaker_0', every segment from speaker B 'speaker_1', and so on.

When external speaker identification is available (e.g. enrolled embeddings from a CRM, or per-channel mic capture in a conferencing platform), diarisation output is typically post-processed to map anonymous labels to known identities.

The Classical Pipeline#

The traditional pipeline, still common in production, runs five stages:

Voice activity detection — remove non-speech regions.
Speaker change detection — split speech into segments unlikely to contain a speaker change.
Speaker embedding — extract a fixed-size vector per segment (x-vector, ECAPA-TDNN, or wav2vec 2.0-based embeddings).
Clustering — group segments by speaker (agglomerative hierarchical, spectral, or VBx clustering with a known or estimated number of speakers).
Resegmentation — refine boundaries using a hidden Markov model or end-to-end neural model.

End-to-End Neural Diarisation#

The pipeline approach has well-understood failure modes: pyannote segmentation errors propagate into clustering, and clustering fails on overlapping speech (two speakers at once). End-to-end neural diarisation (EEND) models, introduced by Fujita et al. and refined heavily by Hervé Bredin and collaborators at pyannote, predict speaker activity directly from audio, naturally handle overlap, and have substantially reduced DER on hard conditions like meetings and call-centre audio.

pyannote 3.x ships a 'pipeline' that combines a powerset-classification segmentation model with embedding-based speaker assignment — the current default for open-source diarisation. sherpa-onnx packages similar models for ONNX Runtime, enabling real-time CPU-friendly diarisation.

Diarisation is much harder than ASR on overlapping speech. Two people talking at the same time is the single largest source of diarisation error in real meetings. Microphone array beamforming, when available, helps substantially.

Combining ASR and Diarisation#

There are two common integration patterns. Sequential: run ASR with word timestamps, run diarisation separately, then assign each word to the speaker whose segment covers its timestamp. Joint: train or run a single model that emits speaker-labelled transcripts directly (Whisper-Diarisation, NeMo Sortformer).

Sequential is simpler and the dominant pattern. WhisperX is a popular open implementation: faster-whisper + wav2vec 2.0 forced alignment for word timestamps + pyannote diarisation for speaker labels. Joint approaches reduce latency and avoid alignment errors at speaker boundaries but require co-training or careful model selection.

python

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN",
)
pipeline.to("cuda")

diarisation = pipeline("meeting.wav", num_speakers=4)

for turn, _, speaker in diarisation.itertracks(yield_label=True):
    print(f"{turn.start:7.2f} - {turn.end:7.2f}  {speaker}")

Deployment Notes#

Diarisation models are far smaller than ASR models — pyannote segmentation runs comfortably on a single L4 or even on CPU for non-realtime use. The expensive parts of a 'speaker-labelled transcript' pipeline are ASR and the LLM that follows, not diarisation itself. On Yobibyte, diarisation is shipped as a thin sidecar service that complements the Whisper / Conformer ASR endpoints.

References

pyannote/pyannote-audio · GitHub
End-to-End Neural Speaker Diarization with Self-Attention · arXiv
k2-fsa/sherpa-onnx · GitHub
pyannote/speaker-diarization-3.1 model card · Hugging Face

The Problem#

The Classical Pipeline#

The traditional pipeline, still common in production, runs five stages:

Voice activity detection — remove non-speech regions.

Speaker change detection — split speech into segments unlikely to contain a speaker change.

Speaker embedding — extract a fixed-size vector per segment (x-vector, ECAPA-TDNN, or wav2vec 2.0-based embeddings).

Clustering — group segments by speaker (agglomerative hierarchical, spectral, or VBx clustering with a known or estimated number of speakers).

Resegmentation — refine boundaries using a hidden Markov model or end-to-end neural model.

End-to-End Neural Diarisation#

Combining ASR and Diarisation#

python

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN",
)
pipeline.to("cuda")

diarisation = pipeline("meeting.wav", num_speakers=4)

for turn, _, speaker in diarisation.itertracks(yield_label=True):
    print(f"{turn.start:7.2f} - {turn.end:7.2f}  {speaker}")

Deployment Notes#

Speaker Diarisation

The Problem#

The Classical Pipeline#

End-to-End Neural Diarisation#

Combining ASR and Diarisation#

Deployment Notes#

References

Browse all entries

Deploy on Yobitel

Speaker Diarisation

The Problem#

The Classical Pipeline#

End-to-End Neural Diarisation#

Combining ASR and Diarisation#

Deployment Notes#

References

Browse all entries

Deploy on Yobitel