Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performanceโready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi Linear offers up to a 75% reduction in KV cache usage and up to 6x decoding throughput at a 1M context length.
Key highlights:
๐น Kimi Delta Attention: A hardware-efficient linear attention mechanism that refines the gated delta rule.
๐น Kimi Linear Architecture: The first hybrid linear architecture to surpass pure full attention quality across the board.
๐น Empirical Validation: Scaled, fair comparisons + open-sourced KDA kernels, vLLM integration, and checkpoints.
The future of agentic-oriented attention is here! ๐ก
โข โข โข
Missing some Tweet in this thread? You can try to
force a refresh
Meet Kimi K2.6 agent - Video hero section, WebGL shaders, real backends. From one prompt.
๐น Video hero sections - cinematic aesthetic, auto-composited
๐น WebGL shader animations - native GLSL / WGSL, liquid metal, caustics, raymarching
๐น Motion design - GSAP + Framer Motion
๐น Backend database: Kimi wires up auth + database + backend in one pass.
๐น Website stack - React 19 + TypeScript + Vite + Tailwind + shadcn/ui
๐น 3D w/ physically-based lighting - Three.js + React Three Fiber
Video hero sections, built right in.
K2.6 agent calls video generation APIs to create real cinematic footage for your hero, not stock placeholders. Composited into the page, synced to scroll, with shader overlays.
Speaks fluent WebGL shader.
Writes GLSL / WGSL directly - fragment shaders, vertex shaders, noise, SDF, raymarching. Prompt: "a liquid-metal hero with soft caustics."
Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers.
๐น Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth.
๐น Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale.
๐น Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead.
๐น Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains.
Scaling law experiments reveal a consistent 1.25ร compute advantage across varying model sizes.
Analysis of training dynamics demonstrates how AttnRes naturally mitigates hidden-state magnitude growth and yields a more uniform gradient distribution across depth.
๐ Hello, Kimi K2! Open-Source Agentic Model!
๐น 1T total / 32B active MoE model
๐น SOTA on SWE Bench Verified, Tau2 & AceBench among open models
๐นStrong in coding and agentic tasks
๐ค Multimodal & thought-mode not supported for now
With Kimi K2, advanced agentic intelligence is more open and accessible than ever. We can't wait to see what you build!
๐ API is here: platform.moonshot.ai
- $0.15 / million input tokens (cache hit)
- $0.60 / million input tokens (cache miss)
- $2.50 / million output tokens
Join the discussion & share feedback in our Discord.๐
To facilitate more research efforts in the field, we are planning on open-sourcing the base pretrained model as well as the reinforcement-learned model underlying Kimi-Researcher in the following months.discord.gg/uGqNmXhNhM