Kimi introduce Attention Residuals: replaces fixed residual connections with softmax attention

Introducing Attention Residuals: Rethinking depth-wise aggregation.

Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, Kimi introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers.

Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth.
Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale.
Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead.
Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains.

Paper link: https://github.com/MoonshotAI/Attention-Residuals/blob/master/Attention_Residuals.pdf

submitted by /u/nekofneko
[link] [comments]