Julien Launay Profile picture
May 24 14 tweets 10 min read
🌸 The @BigScienceLLM BLOOM 176B parameters model training has just passed 230B tokens: that’s more than a million books in two months!

🤔 But how did we decide what model to train with our one million GPU hours?

⬇️ Thread time! #acl2022nlp
🏅 We had five main considerations: it needed to be proven, scalable, efficient, multilingual, and to exhibit emergent capabilities (e.g. zero-shot generalization)

⏰ At the >100B scale, every inefficiency matters! We can’t afford an unoptimized setup…
🤗 Thanks to a generous grant from @Genci_fr on #JeanZay, we had plenty of compute to benchmark our dream architecture.

📈 We ran our experiments with 1.3B models, pretraining on 100-300B tokens, to increase the likelihood our findings would transfer to the final >100B model.
📊 We focused on measuring zero-shot generalization with the EleutherAI LM harness, capturing performance across 27 diverse tasks. We also kept an eye on training speed.
🙅 No finetuning, because it would be difficult to maintain many sets of finetuned weights for the 176B model!
🧱 We based our initial setup on the popular GPT-3 architecture (arxiv.org/abs/2005.14165), knowing it scales well and that it achieves excellent zero-shot performance.

🤔 Why not a BERT or a T5? That’s an excellent question we also answered, in a different paper!
👀As a teaser for now: arxiv.org/abs/2204.05832

🧪 We found causal decoder-only models to be best at zero-shot immediately after pretraining, and that it’s possible to adapt them efficiently into non-causal models to also get good performance on multitask finetuning (e.g. T0).
💾 Onto our benchmark: first, the influence of pretraining data. We trained three models for 112B tokens on OSCAR, C4, and The Pile.

➡️ Proper filtering and the addition of cross-domain high-quality data improves zero-shot generalization! Scale can't compensate for bad data...
🧮 Then, we moved on to the recent hot topic of positional embeddings: we compared no embeddings, learned, rotary, and ALiBi.

➡️ We find that ALiBi positional embeddings significantly outperforms other embeddings (and, surprisingly, not using any embedding isn’t that bad!)
⛩️ We also looked at activation functions, and found that Gated Linear Units provide a small improvement.

⚠️ However, SwiGLU was also 30% slower! It's likely because of a problem in our setup, but that made us steer away from it from the 176B model.
🔬 One last architectural choice: we investigated the addition of a layer normalization after the embedding.

🤔 This enhances training stability, but comes at a notable cost for zero-shot generalization.
🗺️ We also considered multilinguality, training a 1.3B model on 11 languages.

😥 Multilinguality comes at the expense of English-only performance, reducing it significantly. Our results are inline with the findings of XGLM, and are also verified on the final model (stay tuned!)
🚀 We wrapped-up all of these findings into the final @BigScience 176B model. If you want to learn more on how we decided on its size & shape, check out our previous thread:
⤵️ If you are interested in all the nitty gritty details: openreview.net/forum?id=rI7BL…

👨‍🏫If you are at #acl2022nlp, our poster: underline.io/events/284/ses…

🌸And check-out the BigScience workshop on Friday for more open LLM goodness: bigscience.huggingface.co/acl-2022

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Julien Launay

Julien Launay Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @slippylolo

Jun 26, 2020
💡 Can we learn challenging tasks without backpropagation? Scale a biologically-motivated method to hard datasets? Without *any* knowledge of the forward weights in the backward? Yes, We Can!

🎓 arxiv.org/abs/2006.12878
Joint work with @iacopo_poli @KrzakalaF @LightOnIO
[1/9]
🧐 A central question in bio-inspired ML is the weight transport problem: the backward pass cannot realistically access information about the forward weights. While local learning has been demonstrated, methods devoid of weight transport fail on computer vision tasks.
[2/9]
👊 However, computer vision alone cannot be the litmus test of a training method. Thus, focusing on Direct Feedback Alignment (DFA), we conduct a large-scale study across four domains, eight tasks, and with eleven different architectures.
🤞 10,000 GPU-hours later and...
[3/9]
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(