Latest Twitter Threads by @charles0neill on Thread Reader App

Jun 10 • 10 tweets • 3 min read

1/ You can shrink a language model's KV cache by 200×, in a single forward pass, and it still answers correctly.

At 256k context that's 36 GiB of cache down to ~360 MiB, with no change to the base model.

Here's how we did it 👇

2/ The KV cache is the wall everyone hits with long-horizon LLMs eg multi-day agents, repo-scale reasoning, long tool chains. It grows linearly with context and you can't get around it.

So far you've had two bad options

Jan 7, 2025 • 11 tweets • 3 min read

New preprint! In transformers, we often describe the Q/K/V maps in an ad hoc way, but we show these linear self-attention components form a "parametric endofunctor" in a 2-category of linear maps. This takes cues from @bgavran3 et al.’s programme extended to transformers. (1/10) @bgavran3 argue in that deep learning frameworks either impose constraints (GDL) or specify tensor ops (RNNs/Transformers). We need a single theory bridging both views - hence category theory. (2/10)arxiv.org/abs/2402.15332

Share this page!

Enter URL or ID to Unroll