Support Eagle-3 Speculative Decoding in llama.cpp #15902
Replies: 3 comments 2 replies
-
absolutely necessary! |
Beta Was this translation helpful? Give feedback.
-
How important do you think this step is? My experience from some time ago with tree-based speculative decoding is that it does not bring much benefit to basic, linear speculative sampling. And at the same time it complicates the logic quite a bit.
Does the speculative layer maintain any additional state? Or is it purely a function of the 3 hidden states and the last token embedding? |
Beta Was this translation helpful? Give feedback.
-
Yes, the draft model / speculative layer consists of two FC layers, one decoder layer, and one LM head. During the training stage, the parameters of these components will be fine-tuned, resulting in a draft model checkpoint. This is why we need to convert the draft model checkpoint into GGUF format and add architecture support in the llama.cpp as well.
I agree it will complicate the logic a lot. In theory, tree-based decoding allows for the verification of more candidate tokens from a candidates tree simultaneously compared to linear decoding, which could improve the accept rate of draft tokens. Linear sampling includes only a single candidate token to verify per step, resulting in suboptimal speculative accuracy. I checked PR #3624. It uses static expansion of the token tree, sampling Since the benefits of tree-based speculative decoding depend on model quality, Eagle-3 utilizes the tree attention mask during draft model training, which I believe aligns with tree-based sampling during inference. See the following image. In summary, tree-based sampling is necessary for Eagle-3, whereas tree-based candidate verification is not mandatory. Static tree sampling is acceptable for Eagle-3, but according to the paper, context-dependent dynamic tree sampling works better. Based on this, I propose that the initial implementation should use static tree sampling combined with batched candidate sequence verification for simplicity. We can then iterate and improve from there. ![]() |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Eagle-3 is currently the SOTA algorithm for speculative decoding, as demonstrated by Spec-bench and the Eagle-3 paper. However, llama.cpp does not yet support Eagle-3 while other major LLM inference frameworks, such as TRT-LLM, vLLM, and SGLang, already provide support for Eagle-3, resulting in around 2-2.5x performance boost compared to native autoregressive decoding.
Furthermore, there are several PR & issues for implementing Eagle-3 in llama.cpp: #13908, #15305
Many models with Eagle-3 checkpoints are already available on Hugging Face (link), and users can also fine-tune their own Eagle-3 checkpoints using TensorRT-Model-Optimizer.
Based on the above, I see a significant need to implement Eagle-3 in llama.cpp to potentially make LLM inference faster and llama.cpp more competitive. Therefore, I would like to initiate a discussion with the llama.cpp team to align on the goals and implementation.
To implement Eagle-3 in llama.cpp, several components need to be addressed (this may not be 100% accurate, and I am happy to receive any feedback on it):
Workflow: During inference, we need to record low, middle, and high-level features (hidden states after first, middle, and last decoding layers) in the forward pass of the target model. After that, we combine those hidden states and token embedding and feed it to the speculative layer. The speculative layer then generates a sequence of draft tokens autoregressively for parallel verification by the target model.
Since the Eagle-3 checkpoint is model-specific, I propose to start with llama3. I appreciate your feedbacks on it.
Beta Was this translation helpful? Give feedback.
All reactions