-
Notifications
You must be signed in to change notification settings - Fork 201
Description
Potentially significant implications for scaling performance of distributed inference. Potentially greater implications for distributing inference than a naive implementation (An initial thought/guess; citation needed). Transformers has it via:
The model requires its own KV-cache implementation
HuginnDynamicCache
, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.
but no idea if this makes sacrifices/unrealized potential.
Having recently read bigscience-workshop/petals#483 and listening to the pod got me curious about it. The are the obvious benefits but I'm wondering more about distributing inference for a single request. It's a pipe-dream until it isn't.
Papers
https://arxiv.org/abs/2502.05171
https://arxiv.org/abs/2402.14020
POC Model: https://huggingface.co/tomg-group-umd/huginn-0125
Code
https://github.com/seal-rg/recurrent-pretraining
https://github.com/gair-nlp/prox
Interview Pod: https://www.youtube.com/watch?v=dY90DXLi0vk
easy