🚀 1/ Excited to share our (with Aydar Bulatov and @yurakuratov ) report on scaling Recurrent Memory Transformer to 2M (yes, two millions)😮 tokens! 🧠🌐 #AI#NLP#DeepLearning
2/ 📈 We've tackled the quadratic complexity of attention in #Transformers by combining token-based memory & segment-level recurrence, using RMT.
🔸 RMT adapts to any Transformer family model
🔸 Memory tokens provide the recurrent connection 🎛️💡 #AI#NLP#DeepLearning
3/ 🧠 We tested RMT's memorization capabilities with synthetic datasets requiring fact memorization, detection, & reasoning. The model must separate facts from irrelevant text and use them to answer questions in a 6-class classification. 🎯 #AI#NLP#DeepLearning
4/ 📊 In our experiments, we used the pretrained BERT model as the backbone for RMT. We employed curriculum learning, starting with shorter tasks & increasing length upon convergence. This improved accuracy & stability in our model's performance. 💪 #AI#NLP#DeepLearning
5/ 📈 RMT's extrapolation abilities: Models trained on 7 segments generalize surprisingly well even on sequences up to 2,043,904 tokens! 🔝🚀 #AI#NLP#DeepLearning
6/ 🍃 Computational efficiency: RMT scales linearly for any model size with fixed segment length. Larger Transformers exhibit slower quadratic scaling, but RMT requires fewer FLOPs and can reduce FLOPs by up to 295x! 🌟✂️ #AI#NLP#DeepLearning#Efficiency
7/ 🔍 Attention Patterns of Memory Operations: RMT's attention maps reveal specific patterns in memory operations during a reasoning task. 💡📚