Actually, gradient descent can be seen as attention that applies beyond the model's context length! Let me explain why 🧵 👇 (1/N)
Ref: arxiv.org/abs/2202.05798 arxiv.org/abs/2212.10559
If you have a linear layer F(x) := W_0x, gradient descents transform it to F(x) in this image, where LinearAttn is attention over every token / timestep (X’) from all observed training data with their respective gradient (E) as value.
That sounds extremely powerful! 🤯 (2/N)
Dec 12, 2022 • 8 tweets • 5 min read
We have released "Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints"!
Our method converts a pretrained dense model into a MoE by copying the MLP layers and keeps training it, which outperforms continued dense training.
arxiv.org/abs/2212.05055 (1/N)
As this figure shows, Sparse Upcycling outperforms the dense continuation on ViT and T5 in terms of extra pretraining time. (2/N)
Jul 11, 2022 • 12 tweets • 2 min read
Good high-level decision making is probably a key to a better life.
Here is a list of notable good and bad high-level decisions I’ve made in my life so far. (0/N)
When I was 14, to become a researcher I decided to move to America alone for its famed higher education. I was also pessimistic about the outlook of Japan.
It was very expensive but the optimal choice in retrospect! (1/N)
Apr 5, 2022 • 9 tweets • 2 min read
I'm so thankful that 15k people are following me 🥰
Now that I have voice, let me talk about my pretty much overlooked paper about compute-optimal training released in 2019, which proposed some scaling ideas before all the OAI & Google papers 👇 (1/N)
arxiv.org/abs/1906.06669
Idea 1: You can easily enlarge the pretraining dataset, so that you only have to train 1~3 epochs, which dramatically improves the perf-compute trade-off.
Most models were trained using >>10 epochs back then, so I emphasized this point. It's now the standard practice. (2/N)
Jun 9, 2021 • 4 tweets • 2 min read
Ben and I have released GPT-J, 6B JAX-based Transformer LM 🥳
- Performs on par with 6.7B GPT-3
- Performs better and decodes faster than GPT-Neo
- repo + colab + free web demo