Aran Komatsuzaki Profile picture
ML research & startup with @EnricoShippole
Jerome Ku Profile picture 1 subscribed
Feb 6, 2023 5 tweets 2 min read
Actually, gradient descent can be seen as attention that applies beyond the model's context length! Let me explain why 🧵 👇 (1/N)

Ref:
arxiv.org/abs/2202.05798
arxiv.org/abs/2212.10559 If you have a linear layer F(x) := W_0x, gradient descents transform it to F(x) in this image, where LinearAttn is attention over every token / timestep (X’) from all observed training data with their respective gradient (E) as value.

That sounds extremely powerful! 🤯 (2/N)
Dec 12, 2022 8 tweets 5 min read
We have released "Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints"!

Our method converts a pretrained dense model into a MoE by copying the MLP layers and keeps training it, which outperforms continued dense training.

arxiv.org/abs/2212.05055 (1/N) As this figure shows, Sparse Upcycling outperforms the dense continuation on ViT and T5 in terms of extra pretraining time. (2/N)
Jul 11, 2022 12 tweets 2 min read
Good high-level decision making is probably a key to a better life.

Here is a list of notable good and bad high-level decisions I’ve made in my life so far. (0/N) When I was 14, to become a researcher I decided to move to America alone for its famed higher education. I was also pessimistic about the outlook of Japan.

It was very expensive but the optimal choice in retrospect! (1/N)
Apr 5, 2022 9 tweets 2 min read
I'm so thankful that 15k people are following me 🥰

Now that I have voice, let me talk about my pretty much overlooked paper about compute-optimal training released in 2019, which proposed some scaling ideas before all the OAI & Google papers 👇 (1/N)

arxiv.org/abs/1906.06669 Idea 1: You can easily enlarge the pretraining dataset, so that you only have to train 1~3 epochs, which dramatically improves the perf-compute trade-off.

Most models were trained using >>10 epochs back then, so I emphasized this point. It's now the standard practice. (2/N)
Jun 9, 2021 4 tweets 2 min read
Ben and I have released GPT-J, 6B JAX-based Transformer LM 🥳

- Performs on par with 6.7B GPT-3
- Performs better and decodes faster than GPT-Neo
- repo + colab + free web demo

article: bit.ly/2TH8yl0
repo: bit.ly/3eszQ6C Colab: bit.ly/3w0fB6n
demo: bit.ly/3psRCdM

- Trained on 400B tokens with TPU v3-256 for five weeks
- GPT-J performs much closer to GPT-3 of similar size than GPT-Neo does