🏅 We had five main considerations: it needed to be proven, scalable, efficient, multilingual, and to exhibit emergent capabilities (e.g. zero-shot generalization)
⏰ At the >100B scale, every inefficiency matters! We can’t afford an unoptimized setup…
🤗 Thanks to a generous grant from @Genci_fr on #JeanZay, we had plenty of compute to benchmark our dream architecture.
📈 We ran our experiments with 1.3B models, pretraining on 100-300B tokens, to increase the likelihood our findings would transfer to the final >100B model.
📊 We focused on measuring zero-shot generalization with the EleutherAI LM harness, capturing performance across 27 diverse tasks. We also kept an eye on training speed.
🙅 No finetuning, because it would be difficult to maintain many sets of finetuned weights for the 176B model!
🧱 We based our initial setup on the popular GPT-3 architecture (arxiv.org/abs/2005.14165), knowing it scales well and that it achieves excellent zero-shot performance.
🤔 Why not a BERT or a T5? That’s an excellent question we also answered, in a different paper!
🧪 We found causal decoder-only models to be best at zero-shot immediately after pretraining, and that it’s possible to adapt them efficiently into non-causal models to also get good performance on multitask finetuning (e.g. T0).
💾 Onto our benchmark: first, the influence of pretraining data. We trained three models for 112B tokens on OSCAR, C4, and The Pile.
➡️ Proper filtering and the addition of cross-domain high-quality data improves zero-shot generalization! Scale can't compensate for bad data...
🧮 Then, we moved on to the recent hot topic of positional embeddings: we compared no embeddings, learned, rotary, and ALiBi.
➡️ We find that ALiBi positional embeddings significantly outperforms other embeddings (and, surprisingly, not using any embedding isn’t that bad!)
⛩️ We also looked at activation functions, and found that Gated Linear Units provide a small improvement.
⚠️ However, SwiGLU was also 30% slower! It's likely because of a problem in our setup, but that made us steer away from it from the 176B model.
🔬 One last architectural choice: we investigated the addition of a layer normalization after the embedding.
🤔 This enhances training stability, but comes at a notable cost for zero-shot generalization.
🗺️ We also considered multilinguality, training a 1.3B model on 11 languages.
😥 Multilinguality comes at the expense of English-only performance, reducing it significantly. Our results are inline with the findings of XGLM, and are also verified on the final model (stay tuned!)
🚀 We wrapped-up all of these findings into the final @BigScience 176B model. If you want to learn more on how we decided on its size & shape, check out our previous thread:
💡 Can we learn challenging tasks without backpropagation? Scale a biologically-motivated method to hard datasets? Without *any* knowledge of the forward weights in the backward? Yes, We Can!
🧐 A central question in bio-inspired ML is the weight transport problem: the backward pass cannot realistically access information about the forward weights. While local learning has been demonstrated, methods devoid of weight transport fail on computer vision tasks.
[2/9]
👊 However, computer vision alone cannot be the litmus test of a training method. Thus, focusing on Direct Feedback Alignment (DFA), we conduct a large-scale study across four domains, eight tasks, and with eleven different architectures.
🤞 10,000 GPU-hours later and...
[3/9]