Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers.
πΉ Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth.
πΉ Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale.
πΉ Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead.
πΉ Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains.
Scaling law experiments reveal a consistent 1.25Γ compute advantage across varying model sizes.
Analysis of training dynamics demonstrates how AttnRes naturally mitigates hidden-state magnitude growth and yields a more uniform gradient distribution across depth.
π Hello, Kimi K2! Open-Source Agentic Model!
πΉ 1T total / 32B active MoE model
πΉ SOTA on SWE Bench Verified, Tau2 & AceBench among open models
πΉStrong in coding and agentic tasks
π€ Multimodal & thought-mode not supported for now
With Kimi K2, advanced agentic intelligence is more open and accessible than ever. We can't wait to see what you build!
π API is here: platform.moonshot.ai
- $0.15 / million input tokens (cache hit)
- $0.60 / million input tokens (cache miss)
- $2.50 / million output tokens
Join the discussion & share feedback in our Discord.π
To facilitate more research efforts in the field, we are planning on open-sourcing the base pretrained model as well as the reinforcement-learned model underlying Kimi-Researcher in the following months.discord.gg/uGqNmXhNhM