Latest Twitter Threads by @arankomatsuzaki on Thread Reader App

Feb 6, 2023 • 5 tweets • 2 min read

Actually, gradient descent can be seen as attention that applies beyond the model's context length! Let me explain why 🧵 👇 (1/N)

Ref:
arxiv.org/abs/2202.05798
arxiv.org/abs/2212.10559

If you have a linear layer F(x) := W_0x, gradient descents transform it to F(x) in this image, where LinearAttn is attention over every token / timestep (X’) from all observed training data with their respective gradient (E) as value.

That sounds extremely powerful! 🤯 (2/N)

Dec 12, 2022 • 8 tweets • 5 min read

We have released "Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints"!

Our method converts a pretrained dense model into a MoE by copying the MLP layers and keeps training it, which outperforms continued dense training.

arxiv.org/abs/2212.05055 (1/N)

As this figure shows, Sparse Upcycling outperforms the dense continuation on ViT and T5 in terms of extra pretraining time. (2/N)

Jul 11, 2022 • 12 tweets • 2 min read

Good high-level decision making is probably a key to a better life.

Here is a list of notable good and bad high-level decisions I’ve made in my life so far. (0/N) When I was 14, to become a researcher I decided to move to America alone for its famed higher education. I was also pessimistic about the outlook of Japan.

It was very expensive but the optimal choice in retrospect! (1/N)

Apr 5, 2022 • 9 tweets • 2 min read

I'm so thankful that 15k people are following me 🥰

Now that I have voice, let me talk about my pretty much overlooked paper about compute-optimal training released in 2019, which proposed some scaling ideas before all the OAI & Google papers 👇 (1/N)

arxiv.org/abs/1906.06669 Idea 1: You can easily enlarge the pretraining dataset, so that you only have to train 1~3 epochs, which dramatically improves the perf-compute trade-off.

Most models were trained using >>10 epochs back then, so I emphasized this point. It's now the standard practice. (2/N)

Jun 9, 2021 • 4 tweets • 2 min read

Ben and I have released GPT-J, 6B JAX-based Transformer LM 🥳

- Performs on par with 6.7B GPT-3
- Performs better and decodes faster than GPT-Neo
- repo + colab + free web demo

article: bit.ly/2TH8yl0
repo: bit.ly/3eszQ6C

Colab: bit.ly/3w0fB6n
demo: bit.ly/3psRCdM

- Trained on 400B tokens with TPU v3-256 for five weeks
- GPT-J performs much closer to GPT-3 of similar size than GPT-Neo does

Share this page!

Enter URL or ID to Unroll