Conformer

TL;DR

Conformer was introduced by Gulati et al. at Google in the 2020 paper 'Conformer: Convolution-augmented Transformer for Speech Recognition' (arXiv:2005.08100).
It interleaves self-attention (good at global context) with depthwise separable convolutions (good at local feature extraction) inside each block — both modules see the same residual stream.
On LibriSpeech the original paper reported WER of 2.1/4.3 on test-clean/test-other with an external language model, state-of-the-art at the time.
Conformer is the encoder backbone for NVIDIA NeMo's flagship ASR models (Parakeet, Canary), Google's USM, and most production streaming ASR systems shipping in 2026.

Motivation#

Self-attention captures long-range dependencies across an utterance — useful for resolving ambiguities that depend on context several seconds away. Convolutions capture local patterns — phoneme boundaries, formant transitions, fricative bursts — that live in tens of milliseconds. Prior to Conformer, ASR encoders typically picked one: Transformer-based models (Speech-Transformer, Transformer-XL) had attention only; ContextNet and QuartzNet had convolutions only.

Gulati et al. argued, and demonstrated, that the two are complementary and that interleaving them in every block beats either alone at iso-parameter count.

The Conformer Block#

A Conformer block is a 'sandwich' of four sub-modules, each with residual connections and layer normalisation:

Feed-forward module (half-step residual, factor 0.5) — a position-wise FFN with Swish activation.
Multi-head self-attention with relative positional encoding (Transformer-XL style) — captures global context.
Convolution module — pointwise conv, gated linear unit, 1D depthwise conv with batch norm and Swish, then a second pointwise conv. Captures local context.
Feed-forward module (half-step residual, factor 0.5) — same as the first.

Training Recipes#

The original paper trained Conformer encoders with CTC, transducer (RNN-T), or attention-encoder-decoder loss heads. RNN-T — where a separate prediction network combines with the encoder via a joint network and trains with the transducer loss — is the dominant choice for production streaming systems because it supports natural streaming inference and avoids label-emission delays.

NVIDIA NeMo's Parakeet family pairs a FastConformer encoder (an efficient Conformer variant with 8× downsampling and shared convolutional subsampling) with either CTC or RNN-T heads. Parakeet RNN-T 1.1B and Parakeet TDT 0.6B sit near the top of the Hugging Face Open ASR Leaderboard as of mid-2026, with average WERs in the low single digits across the eight benchmark datasets.

Streaming vs Offline#

For streaming use, Conformer is run with a limited right-context attention mask (chunk-based or look-ahead-bounded) and causal convolutions so that emission can begin before the utterance ends. The trade-off is well understood: a larger right-context window improves WER but increases latency. Production systems typically pick a chunk size of 160-480 ms and accept the corresponding emission delay.

Offline Conformer enjoys full bidirectional attention and non-causal convolutions, and consistently reports lower WER than its streaming twin trained on the same data.

If you need both modes, train a single 'unified' Conformer with dynamic chunk sizes during training — the same checkpoint then serves both streaming and offline inference. WeNet and ESPnet ship this recipe out of the box.

Where Conformer Sits in 2026#

Conformer is the default ASR encoder in NeMo, ESPnet, WeNet, K2/Icefall, and SpeechBrain. It is the encoder under NVIDIA Canary 1B (multilingual + translation), Parakeet (English streaming and offline), and many internal Google and Amazon production ASR stacks. Whisper, by contrast, uses a vanilla Transformer encoder — its robustness comes from data scale, not architecture choice.

Pure self-attention models with sufficient data (Whisper) and pure attention-free models (e.g. SSM-based) have closed some of the gap, but for low-latency streaming ASR with constrained model size, Conformer remains the architecture to beat.

References

Conformer: Convolution-augmented Transformer for Speech Recognition · arXiv
NVIDIA NeMo ASR collection · GitHub
ESPnet speech processing toolkit · GitHub
Hugging Face Open ASR Leaderboard · Hugging Face

Motivation#

Gulati et al. argued, and demonstrated, that the two are complementary and that interleaving them in every block beats either alone at iso-parameter count.

The Conformer Block#

A Conformer block is a 'sandwich' of four sub-modules, each with residual connections and layer normalisation:

Feed-forward module (half-step residual, factor 0.5) — a position-wise FFN with Swish activation.

Multi-head self-attention with relative positional encoding (Transformer-XL style) — captures global context.

Convolution module — pointwise conv, gated linear unit, 1D depthwise conv with batch norm and Swish, then a second pointwise conv. Captures local context.

Feed-forward module (half-step residual, factor 0.5) — same as the first.

Training Recipes#

Streaming vs Offline#

Offline Conformer enjoys full bidirectional attention and non-causal convolutions, and consistently reports lower WER than its streaming twin trained on the same data.

Where Conformer Sits in 2026#

Conformer

Motivation#

The Conformer Block#

Training Recipes#

Streaming vs Offline#

Where Conformer Sits in 2026#

References

Browse all entries

Deploy on Yobitel

Conformer

Motivation#

The Conformer Block#

Training Recipes#

Streaming vs Offline#

Where Conformer Sits in 2026#

References

Browse all entries

Deploy on Yobitel