Jan P. Harries Profile picture
Jul 23 21 tweets 8 min read Read on X
Live tweeting the most interesting insights from @Meta´s new Llama3 paper

1. How did the arrive at a 405b model trained with ~15T tokens?
"Extrapolation of the resulting scaling law to 3.8 × 1025 FLOPs suggests training a 402B parameter model on 16.55T tokens." 👇🧵 Image
2. The paper contains a surpisingly detailed description of the network topology for their 24k H100 cluster @dylan522p
Image
Image
@dylan522p 3. Two Llama3-405b training interruptions were actually caused by the "Server Chassis" failing (someone sitting on the Rack? 😆) - and 148 poor H100s died during pre-training... Image
@dylan522p 4. Meta adjusted training data for various reasons during training - apparently with good success: Image
5. Didn't see this one before: @Meta´s post training pipeline utilizes Pairwise annotated preference data to train (and use) both, a Reward Model for early-stage Rejection Sampling and to improve intermediary SFT models with DPO 🤯 - SPIN on steroids!? 😉 Image
@Meta 6. In Contrast to the NeMo Paper (Reward Model > LLM-as-a-judge) they find high disagreement rates between both approaches and choose to include top-rated samples from either judge Image
7. Specific capabilities (like Coding or Multilingual generation) were improved by branching off the pre-training run and training experts for these capabilities that are then used to annotate/critique samples later on 🧐 Image
8. Translated data was avoided (I´ve been advocating that translating instruction data deteriorates output quality and didn't use translations for finetuning starting with the EM German models more than a year ago, reassuring to see them coming to the same conclusion 🙂) Image
9. Not enough high-quality prompts? 🤔
Just "ask humans" (which is a nice description for highly-trained, very expensive math professionals 😄) to generate some more.... if you´re @Meta 🥲.
(Apart from that: impressive effort... stepwise reward models 🤩) Image
@Meta 10. They find that it´s fine to only use DPO with short context data for long context models 🚓🚓🚓 Image
@Meta 11. Impressive Multi-step tool usage trajectories are easier to train if you have annotators providing granular feedback at the message level 👍🛠️💰
Image
Image
@Meta 12. A post-training procedure to reduce hallucinations was employed, aligning the model to "know what it knows" - will be interesting to see how good this works (we often used counterfactual/fabricated statements specificially to train for RAG applications) Image
@Meta 13. For some reading comprehension tasks (SQuAD and RACE), Llama3 405B actually trails Mixtral 8x22B 🧐 (this is the un-finetuned basemodel) Image
@Meta 14. Often proposed but rarely done: They included robustness checks for permutations of order/prompt/choices in MC benchmarks 👏 Image
15. While achieving impressive scores throughout ALL exams (wtf will happen to school exams ? 🙈), Llama3 405B blows away the competition in AP Physics (92.9 vs 78.6 for Claude 3.5 🤯) - and also achieves top scores for GMAT Quant, LSAT, AP Env Sci and .... Art History 😁)Image
16. Human preference Data shows worse multilingual and multi-turn coding performance, especially vs. GPT-4o; however GPT-4o seems to be VERY optimized for human preference data (see LMSys results), so this doesn't necessarily results in worse performance for real-world tasks... Image
17. The paper even includes a new (?) row-wise FP8 quantization approach and a reward-score based evaluation of its (unsurprisingly) negligible impact on output quality 👌 Image
17. How can we get LLMs to better understand multimodal data? The detailed description of @Meta´s approach will surely help the next generation of Open Source multimodal models (esp video) 👋. What a shame that these won't be available in the EU 🥺
Image
@Meta 18. At this point it´s no surprise that the vision capabilities also seem to be excellent - with Llama 3-V 8b rivaling the original GPT-4V 😀. I´m wondering if a finetuned/quantized/optimized Llama 3-V 8b version will be the most-used OCR engine going forward? Image
19. What else? Llama3 can also understand speech better than whisper and seamlessly switch between languages in conversation. The section on Prosody modeling (tone/speed variation) is also a first I think? really interesting and novel stuff 🤗Image
Image
@Meta 20. Link to the full paper here: 📜👋scontent-bru2-1.xx.fbcdn.net/v/t39.2365-6/4…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jan P. Harries

Jan P. Harries Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(