Latest Twitter Threads by @seb_ruder on Thread Reader App

Apr 28 • 4 tweets • 3 min read

The Sparse Frontier

Efficient sparse attention methods are key to scale LLMs to long contexts. We conduct the largest-scale empirical analysis that answers:
1. 🤏🔍 Are small dense models or large sparse models better?
2. ♾️What is the maximum permissible sparsity per task?
3. 🥇Is there a single best sparse attention method?
4. 📈 Are there generalizable scaling laws for sparse attention?

Our findings:
1) For short seqs, increasing density or size provides gains. For long seqs, high sparsity performs best.
2) Higher sparsity is possible for decoding and larger models. However, most configs deteriorate performance significantly for at least one task.
3) There is no single sparse attention method that excels across all tasks.
4) We establish scaling laws based on model size, sequence length, compression ratio, and task id that reliably predict performance.

Apr 23, 2021 • 8 tweets • 3 min read

Really enjoyed today’s ML for NLP session at @eaclmeeting. It included 3 papers on task-agnostic methods to improve pre-trained models for downstream tasks via:

- test-time adaptation w/ meta-learning
- many robust classification heads
- combining adapters.

Here are some notes: @eaclmeeting Keep Learning: Self-supervised Meta-learning for Learning from Inference

Fine-tunes model on most confident predictions at test time. Class balanced filtering, meta-learning, and regularization to pre-trained weights are all important.

aclweb.org/anthology/2021…

Jun 5, 2019 • 49 tweets • 15 min read

Coming up: A live Twitter thread of Session 8B: Machine Learning @NAACLHLT with some awesome papers on vocabulary size, subwords, Bayesian learning, multi-task learning, and inductive biases @NAACLHLT First paper: How Large a Vocabulary Does Text Classification Need?
A Variational Approach to Vocabulary Selection aclweb.org/anthology/N19-…

Sep 13, 2018 • 10 tweets • 4 min read

David Silver on Principles for Reinforcement Learning at the #DLIndaba2018. Important principles that are not only applicable to RL, but to ML research in general. E.g. leaderboard-driven research vs. hypothesis-driven research (see the slides below).

Principle 2. How an algorithm scales is more important than its starting point. Avoid performance ceilings. Deep Learning is successful because it scales so effectively.
Principles are meant to be controversial. I would argue that sample efficiency is at least as important.

Jul 20, 2018 • 39 tweets • 6 min read

#Repl4NLP at #ACL2018 panel discussion:
Q: Given that the amount of data and computing power is rapidly increasing, should we just quit working on models altogether?
Yejin: Sounds like a good idea for the companies. The more data the better. Please create more data. Meg: Different people have different strengths. People say: “We should all care about ethics”. Geek out about what you love. Apply yourself to what you. Lots of other things that come to bear besides just working with data, e.g. sociology, psychology, maths, etc.

Jun 5, 2018 • 22 tweets • 4 min read

All-star panel at the generalization in deep learning workshop at @NAACLHLT #Deepgen2018

: "We should have more inductive biases. We are clueless about how to add inductive biases so we do dataset augmentation, create pseudo training data to encode those biases. Seems like a strange way to go about doing things."

Mar 31, 2018 • 16 tweets • 3 min read

1/ People (mostly people working with Computer Vision) say that CV is ahead of other ML application domains by at least 6 months - 1 year. I would like to explore why this is, if this is something to be concerned about, and what it might take to catch up. 2/ I can’t speak about other application areas, so I will mostly compare CV vs. NLP. This is just a braindump, so feel free to criticize, correct, and disagree.

Share this page!

Enter URL or ID to Unroll