EAGLE-2

TL;DR

Speculative decoding variant introduced by Li et al. (arXiv:2406.16858, 2024).
Predicts the next feature (hidden state) rather than the next token, using a small autoregressive head trained on top of the target model's penultimate-layer features.
Improves on EAGLE-1 with dynamic draft trees — the draft tree shape adapts per context, sampling more candidates where the head is confident.
Reported speedups: ~3-5x end-to-end on Vicuna and Llama families on H100, distribution-preserving.

Where EAGLE Fits#

EAGLE-2 builds on the EAGLE family of speculative-decoding methods, which observe that predicting hidden states (features) is easier than predicting tokens directly because features are smoother and lower entropy. A small autoregressive head — typically a one- or two-layer Transformer — predicts the next feature from the target model's penultimate hidden state, and that predicted feature is fed back through the target's LM head to produce a draft token.

Compared to drafting with an entirely separate small LLM, the feature-level approach gives higher acceptance rates because the draft head shares the target's representation space.

Dynamic Draft Trees#

EAGLE-1 generated a fixed draft tree — a tree of candidate continuations with a static fanout. EAGLE-2 makes the tree dynamic: at each step the draft head's confidence determines how wide the tree grows. Confident steps grow narrow; uncertain steps explore more candidates. The target model verifies the whole tree in one parallel forward pass.

The result is higher average accepted-token count per verification step at the same verification cost.

EAGLE-2 needs a per-model draft head; pre-trained heads are published on Hugging Face for popular base models. Training a head from scratch takes a few GPU-hours.

Integration#

vLLM supports EAGLE-2 via the speculative-config interface.
TensorRT-LLM ships an EAGLE-2 plugin in the speculative-decoding builder.
SGLang supports EAGLE-2 natively.
Pre-trained heads exist for Llama 2/3, Vicuna, Mistral and Qwen families.

Measured Speedup#

Published numbers for EAGLE-2 cluster in the 3-5x end-to-end speedup range on chat workloads with batch size 1-8, distribution-preserving. At higher batch sizes the speedup falls toward the 1.5-2x range as the target forward pass becomes compute-bound and parallel verification stops being cheap.

The technique is therefore most effective at moderate concurrency — a typical interactive chat service rather than a high-throughput batch endpoint.

Trade-offs#

Requires training and shipping an auxiliary head per (base model, fine-tune) combination, which adds operational overhead. Quality is unaffected — the rejection-sampling step preserves the target distribution exactly — so the trade-off is purely engineering.

References

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees · arXiv (Li et al., 2024)
EAGLE-1 · arXiv (Li et al., 2024)
EAGLE on GitHub · GitHub

Where EAGLE Fits#

Compared to drafting with an entirely separate small LLM, the feature-level approach gives higher acceptance rates because the draft head shares the target's representation space.

Dynamic Draft Trees#

The result is higher average accepted-token count per verification step at the same verification cost.

EAGLE-2 needs a per-model draft head; pre-trained heads are published on Hugging Face for popular base models. Training a head from scratch takes a few GPU-hours.

Measured Speedup#

The technique is therefore most effective at moderate concurrency — a typical interactive chat service rather than a high-throughput batch endpoint.

EAGLE-2

Where EAGLE Fits#

Dynamic Draft Trees#

Integration#

Measured Speedup#

Trade-offs#

References

Browse all entries

Deploy on Yobitel

EAGLE-2

Where EAGLE Fits#

Dynamic Draft Trees#

Integration#

Measured Speedup#

Trade-offs#

References

Browse all entries

Deploy on Yobitel