Ali Behrouz Profile picture
Intern @Google, Ph.D. Student @Cornell_CS, interested in machine learning.
May 30 11 tweets 6 min read
What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers?

Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to memorize the context at test time. Atlas even outperforms Titans, and is more effective than Transformers and modern linear RNNs in language modeling tasks. It further improves the effective context length of Titans and scales to 10M context window with +80% accuracy on the BABILong benchmark. (Bonus: Building on Atlas ideas, we also discuss another family of models that are strict generalization of softmax attention)Image Paper:

Next, I’ll discuss the intuition behind the design, how Atlas works and is implemented, how these intuitions can help to design even more powerful attention mechanisms, and how these models perform in practice. (We have six new models in this paper!)arxiv.org/pdf/2505.23735Image
Jan 13 12 tweets 6 min read
Attention has been the key component for most advances in LLMs, but it can’t scale to long context. Does this mean we need to find an alternative?

Presenting Titans: a new architecture with attention and a meta in-context memory that learns how to memorize at test time. Titans are more effective than Transformers and modern linear RNNs, and can effectively scale to larger than 2M context window, with better performance than ultra-large models (e.g., GPT4, Llama3-80B).Image @mirrokni Paper:

Next, I'll discuss our intuition for designing Titans, how we implement them efficiently, and experimental results (i.e., how they perform in different tasks).arxiv.org/pdf/2501.00663…