Thread by @jphme on Thread Reader App

Live tweeting the most interesting insights from @Meta´s new Llama3 paper

1. How did the arrive at a 405b model trained with ~15T tokens?
"Extrapolation of the resulting scaling law to 3.8 × 1025 FLOPs suggests training a 402B parameter model on 16.55T tokens." 👇🧵

2. The paper contains a surpisingly detailed description of the network topology for their 24k H100 cluster @dylan522p

@dylan522p 3. Two Llama3-405b training interruptions were actually caused by the "Server Chassis" failing (someone sitting on the Rack? 😆) - and 148 poor H100s died during pre-training...

@dylan522p 4. Meta adjusted training data for various reasons during training - apparently with good success:

5. Didn't see this one before: @Meta´s post training pipeline utilizes Pairwise annotated preference data to train (and use) both, a Reward Model for early-stage Rejection Sampling and to improve intermediary SFT models with DPO 🤯 - SPIN on steroids!? 😉

@Meta 6. In Contrast to the NeMo Paper (Reward Model > LLM-as-a-judge) they find high disagreement rates between both approaches and choose to include top-rated samples from either judge

7. Specific capabilities (like Coding or Multilingual generation) were improved by branching off the pre-training run and training experts for these capabilities that are then used to annotate/critique samples later on 🧐

8. Translated data was avoided (I´ve been advocating that translating instruction data deteriorates output quality and didn't use translations for finetuning starting with the EM German models more than a year ago, reassuring to see them coming to the same conclusion 🙂)

9. Not enough high-quality prompts? 🤔
Just "ask humans" (which is a nice description for highly-trained, very expensive math professionals 😄) to generate some more.... if you´re @Meta 🥲.
(Apart from that: impressive effort... stepwise reward models 🤩)

@Meta 10. They find that it´s fine to only use DPO with short context data for long context models 🚓🚓🚓

@Meta 11. Impressive Multi-step tool usage trajectories are easier to train if you have annotators providing granular feedback at the message level 👍🛠️💰

@Meta 12. A post-training procedure to reduce hallucinations was employed, aligning the model to "know what it knows" - will be interesting to see how good this works (we often used counterfactual/fabricated statements specificially to train for RAG applications)

@Meta 13. For some reading comprehension tasks (SQuAD and RACE), Llama3 405B actually trails Mixtral 8x22B 🧐 (this is the un-finetuned basemodel)

@Meta 14. Often proposed but rarely done: They included robustness checks for permutations of order/prompt/choices in MC benchmarks 👏

15. While achieving impressive scores throughout ALL exams (wtf will happen to school exams ? 🙈), Llama3 405B blows away the competition in AP Physics (92.9 vs 78.6 for Claude 3.5 🤯) - and also achieves top scores for GMAT Quant, LSAT, AP Env Sci and .... Art History 😁)

16. Human preference Data shows worse multilingual and multi-turn coding performance, especially vs. GPT-4o; however GPT-4o seems to be VERY optimized for human preference data (see LMSys results), so this doesn't necessarily results in worse performance for real-world tasks...

17. The paper even includes a new (?) row-wise FP8 quantization approach and a reward-score based evaluation of its (unsurprisingly) negligible impact on output quality 👌

17. How can we get LLMs to better understand multimodal data? The detailed description of @Meta´s approach will surely help the next generation of Open Source multimodal models (esp video) 👋. What a shame that these won't be available in the EU 🥺

@Meta 18. At this point it´s no surprise that the vision capabilities also seem to be excellent - with Llama 3-V 8b rivaling the original GPT-4V 😀. I´m wondering if a finetuned/quantized/optimized Llama 3-V 8b version will be the most-used OCR engine going forward?

19. What else? Llama3 can also understand speech better than whisper and seamlessly switch between languages in conversation. The section on Prosody modeling (tone/speed variation) is also a first I think? really interesting and novel stuff 🤗

@Meta 20. Link to the full paper here: 📜👋scontent-bru2-1.xx.fbcdn.net/v/t39.2365-6/4…

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll