Skip to content

[Feature Request] Recurrent Depth Latent Reasoning #647

@bitnom

Description

@bitnom

Potentially significant implications for scaling performance of distributed inference. Potentially greater implications for distributing inference than a naive implementation (An initial thought/guess; citation needed). Transformers has it via:

The model requires its own KV-cache implementation HuginnDynamicCache, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.

but no idea if this makes sacrifices/unrealized potential.

Having recently read bigscience-workshop/petals#483 and listening to the pod got me curious about it. The are the obvious benefits but I'm wondering more about distributing inference for a single request. It's a pipe-dream until it isn't.

Papers

https://arxiv.org/abs/2502.05171
https://arxiv.org/abs/2402.14020

POC Model: https://huggingface.co/tomg-group-umd/huginn-0125

Code

https://github.com/seal-rg/recurrent-pretraining

https://github.com/gair-nlp/prox

Interview Pod: https://www.youtube.com/watch?v=dY90DXLi0vk

easy

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions