TL;DR
- Speculative decoding variant introduced by Li et al. (arXiv:2406.16858, 2024).
- Predicts the next feature (hidden state) rather than the next token, using a small autoregressive head trained on top of the target model's penultimate-layer features.
- Improves on EAGLE-1 with dynamic draft trees — the draft tree shape adapts per context, sampling more candidates where the head is confident.
- Reported speedups: ~3-5x end-to-end on Vicuna and Llama families on H100, distribution-preserving.
Where EAGLE Fits#
EAGLE-2 builds on the EAGLE family of speculative-decoding methods, which observe that predicting hidden states (features) is easier than predicting tokens directly because features are smoother and lower entropy. A small autoregressive head — typically a one- or two-layer Transformer — predicts the next feature from the target model's penultimate hidden state, and that predicted feature is fed back through the target's LM head to produce a draft token.
Compared to drafting with an entirely separate small LLM, the feature-level approach gives higher acceptance rates because the draft head shares the target's representation space.
Dynamic Draft Trees#
EAGLE-1 generated a fixed draft tree — a tree of candidate continuations with a static fanout. EAGLE-2 makes the tree dynamic: at each step the draft head's confidence determines how wide the tree grows. Confident steps grow narrow; uncertain steps explore more candidates. The target model verifies the whole tree in one parallel forward pass.
The result is higher average accepted-token count per verification step at the same verification cost.
EAGLE-2 needs a per-model draft head; pre-trained heads are published on Hugging Face for popular base models. Training a head from scratch takes a few GPU-hours.
Integration#
- vLLM supports EAGLE-2 via the speculative-config interface.
- TensorRT-LLM ships an EAGLE-2 plugin in the speculative-decoding builder.
- SGLang supports EAGLE-2 natively.
- Pre-trained heads exist for Llama 2/3, Vicuna, Mistral and Qwen families.
Measured Speedup#
Published numbers for EAGLE-2 cluster in the 3-5x end-to-end speedup range on chat workloads with batch size 1-8, distribution-preserving. At higher batch sizes the speedup falls toward the 1.5-2x range as the target forward pass becomes compute-bound and parallel verification stops being cheap.
The technique is therefore most effective at moderate concurrency — a typical interactive chat service rather than a high-throughput batch endpoint.
Trade-offs#
Requires training and shipping an auxiliary head per (base model, fine-tune) combination, which adds operational overhead. Quality is unaffected — the rejection-sampling step preserves the target distribution exactly — so the trade-off is purely engineering.
References
- EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees · arXiv (Li et al., 2024)
- EAGLE-1 · arXiv (Li et al., 2024)
- EAGLE on GitHub · GitHub