1/ You can shrink a language model's KV cache by 200×, in a single forward pass, and it still answers correctly.
At 256k context that's 36 GiB of cache down to ~360 MiB, with no change to the base model.
Here's how we did it 👇 2/ The KV cache is the wall everyone hits with long-horizon LLMs eg multi-day agents, repo-scale reasoning, long tool chains. It grows linearly with context and you can't get around it.
So far you've had two bad options
Jan 7, 2025 • 11 tweets • 3 min read
New preprint! In transformers, we often describe the Q/K/V maps in an ad hoc way, but we show these linear self-attention components form a "parametric endofunctor" in a 2-category of linear maps. This takes cues from @bgavran3 et al.’s programme extended to transformers. (1/10)
@bgavran3 argue in that deep learning frameworks either impose constraints (GDL) or specify tensor ops (RNNs/Transformers). We need a single theory bridging both views - hence category theory. (2/10)arxiv.org/abs/2402.15332