Support Eagle-3 Speculative Decoding in llama.cpp #15902

ichbinhandsome · 2025-09-09T15:12:10Z

ichbinhandsome
Sep 9, 2025

Eagle-3 is currently the SOTA algorithm for speculative decoding, as demonstrated by Spec-bench and the Eagle-3 paper. However, llama.cpp does not yet support Eagle-3 while other major LLM inference frameworks, such as TRT-LLM, vLLM, and SGLang, already provide support for Eagle-3, resulting in around 2-2.5x performance boost compared to native autoregressive decoding.

Furthermore, there are several PR & issues for implementing Eagle-3 in llama.cpp: #13908, #15305

Many models with Eagle-3 checkpoints are already available on Hugging Face (link), and users can also fine-tune their own Eagle-3 checkpoints using TensorRT-Model-Optimizer.

Based on the above, I see a significant need to implement Eagle-3 in llama.cpp to potentially make LLM inference faster and llama.cpp more competitive. Therefore, I would like to initiate a discussion with the llama.cpp team to align on the goals and implementation.

To implement Eagle-3 in llama.cpp, several components need to be addressed (this may not be 100% accurate, and I am happy to receive any feedback on it):

Eagle-3 Draft Model Checkpoint Support:
- Convert Eagle-3 Draft PyTorch Checkpoints to GGUF Format
- Eagle-3 Draft Model Architecture Support
- Feature Extraction from Target Model Forward Pass: Hidden states after the first, middle, and last decoding layers.
Draft Tokens Sampling & Parallel Verification with Target Model:
- context-dependent dynamic draft tree structure support
- Acceptance & Rejection Strategy for Draft Tokens (Speculative Sampling)
Repeat above until max-token-len or EOS token

Workflow: During inference, we need to record low, middle, and high-level features (hidden states after first, middle, and last decoding layers) in the forward pass of the target model. After that, we combine those hidden states and token embedding and feed it to the speculative layer. The speculative layer then generates a sequence of draft tokens autoregressively for parallel verification by the target model.

Since the Eagle-3 checkpoint is model-specific, I propose to start with llama3. I appreciate your feedbacks on it.

lippman1125 · 2025-09-13T04:18:10Z

lippman1125
Sep 13, 2025

absolutely necessary!

0 replies

ggerganov · 2025-09-19T15:53:40Z

ggerganov
Sep 19, 2025
Maintainer

context-dependent dynamic draft tree structure support

How important do you think this step is? My experience from some time ago with tree-based speculative decoding is that it does not bring much benefit to basic, linear speculative sampling. And at the same time it complicates the logic quite a bit.

After that, we combine those hidden states and token embedding and feed it to the speculative layer.

Does the speculative layer maintain any additional state? Or is it purely a function of the 3 hidden states and the last token embedding?

0 replies

ichbinhandsome · 2025-09-22T21:55:21Z

ichbinhandsome
Sep 22, 2025
Author

Does the speculative layer maintain any additional state? Or is it purely a function of the 3 hidden states and the last token embedding?

Yes, the draft model / speculative layer consists of two FC layers, one decoder layer, and one LM head. During the training stage, the parameters of these components will be fine-tuned, resulting in a draft model checkpoint. This is why we need to convert the draft model checkpoint into GGUF format and add architecture support in the llama.cpp as well.

How important do you think this step is? My experience from some time ago with tree-based speculative decoding is that it does not bring much benefit to basic, linear speculative sampling. And at the same time it complicates the logic quite a bit.

I agree it will complicate the logic a lot. In theory, tree-based decoding allows for the verification of more candidate tokens from a candidates tree simultaneously compared to linear decoding, which could improve the accept rate of draft tokens. Linear sampling includes only a single candidate token to verify per step, resulting in suboptimal speculative accuracy.

I checked PR #3624. It uses static expansion of the token tree, sampling n candidate tokens at each decoding step to form the tree node. Eagle-3 supports context-dependent dynamic tree sampling, which approximates the acceptance
rate using the confidence of the draft model and dynamically generates the draft tree based on this, performing pruning of the draft tree at the end of the drafting stage (details in eagle-2 paper).

Since the benefits of tree-based speculative decoding depend on model quality, Eagle-3 utilizes the tree attention mask during draft model training, which I believe aligns with tree-based sampling during inference. See the following image.

In summary, tree-based sampling is necessary for Eagle-3, whereas tree-based candidate verification is not mandatory. Static tree sampling is acceptable for Eagle-3, but according to the paper, context-dependent dynamic tree sampling works better. Based on this, I propose that the initial implementation should use static tree sampling combined with batched candidate sequence verification for simplicity. We can then iterate and improve from there.
@ggerganov

2 replies

ggerganov Sep 25, 2025
Maintainer

Based on this, I see the following strategy for implementation:

Convert Eagle checkpoints to GGUF
Add support for loading Eagle GGUFs similar to all other models (i.e. a separate architecture in llama-model)
From the diagram, it looks like the Eagle architecture has to be implemented as Encoder-Decoder in order to support the "FC Layer" for the extracted intermediate embeddings of the target model. Basically, the Encoder will apply a single FC to an input batch of h,m,l embeddings in order to obtain the g embeddings (i.e. use llama_encode). The g embeddings will then be used in the Decoder to process an input batch of e tokens (i.e. use llama_decode).
I think the computed g embeddings from the Encoder, together with the accumulated a embeddings after each Decode step should probably be stored into a new auxilary struct in the llama_context, similar to the existing llama_cross. Generally, llama_cross is in a very rough state at the moment and likely will be redesigned in the future in some way, but I think for now, we can follow the existing implementation.
It is not clear from the diagram, how big the context of the Eagle can become? Does it keep track of the last few tokens, or does it grow as big as the context of the Target model?
For the Target model, we need to introduce a mechanism to extract h,m,l embeddings. This can be achieved in a similar way as described in server: implement GLM-style MTP #15225 (comment)

Let me know if this makes sense or if something is unclear.

ichbinhandsome Sep 26, 2025
Author

Thank you very much for the valuable feedback.

From the diagram, it looks like the Eagle architecture has to be implemented as Encoder-Decoder in order to support the "FC Layer" for the extracted intermediate embeddings of the target model. Basically, the Encoder will apply a single FC to an input batch of h,m,l embeddings in order to obtain the g embeddings (i.e. use llama_encode). The g embeddings will then be used in the Decoder to process an input batch of e tokens (i.e. use llama_decode).

This is almost correct. The missing part is e, which represents the embeddings of the generated tokens. The draft-eagle model shares the same embedding layer as the target model. I checked an example yuhuili/EAGLE3-LLaMA3.1-Instruct-8B checkpoint, and it seems the checkpoint already has the embedding layer, so we don't need to worry about reusing the embedding layer from the target model. Example of eagle-3 model structure of yuhuili/EAGLE3-LLaMA3.1-Instruct-8B

LlamaModel(
  (embed_tokens): Embedding(128256, 4096, padding_idx=0)
  (layers): ModuleList(
    (0): LlamaDecoderLayer(
      (self_attn): LlamaAttention(
        (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
        (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
        (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
      )
      (mlp): LlamaMLP(
        (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
        (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
        (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
        (act_fn): SiLU()
      )
      (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
    )
  )
  (norm): LlamaRMSNorm((4096,), eps=1e-05)
  (rotary_emb): LlamaRotaryEmbedding()
)

It is not clear from the diagram, how big the context of the Eagle can become? Does it keep track of the last few tokens, or does it grow as big as the context of the Target model?

I don't fully understand this question. If you are referring to the context regarding the KV cache, I would say the context is as large as the target model. (With tree-based decoding, it can become even larger) Both the eagle model and the target model maintain seperate KV caches. The target model only maintains the KV cache for accepted or correct tokens, while the eagle model maintains the KV cache for all candidate tokens, including those that might be rejected by the target model. Therefore, we also need a KV cache eviction mechanism for the eagle model.

I have one question is about the draft tokens generation and verification implementation. As I proposed before, the initial implementation should use static tree sampling combined with batched candidate sequence verification for simplicity. Do you have any suggestions regarding this? Do we need to implement some features inside llama.cpp to support this, or can the current implementation in llama.cpp support it? @ggerganov

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Eagle-3 Speculative Decoding in llama.cpp #15902

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Support Eagle-3 Speculative Decoding in llama.cpp #15902

Uh oh!

ichbinhandsome Sep 9, 2025

Replies: 3 comments · 2 replies

Uh oh!

lippman1125 Sep 13, 2025

Uh oh!

Uh oh!

ggerganov Sep 19, 2025 Maintainer

Uh oh!

Uh oh!

ichbinhandsome Sep 22, 2025 Author

Uh oh!

ggerganov Sep 25, 2025 Maintainer

Uh oh!

Uh oh!

ichbinhandsome Sep 26, 2025 Author

ichbinhandsome
Sep 9, 2025

Replies: 3 comments 2 replies

lippman1125
Sep 13, 2025

ggerganov
Sep 19, 2025
Maintainer

ichbinhandsome
Sep 22, 2025
Author

ggerganov Sep 25, 2025
Maintainer

ichbinhandsome Sep 26, 2025
Author