Harm de Vries Profile picture
Research Scientist @ServiceNowRsrch @BigCodeProject
Apr 13, 2023 8 tweets 3 min read
Surprised by the loss of LLaMA-7B still going down after 1 trillion tokens?

In a new blogpost, I explain why you shouldn't be and argue we haven't reached the limit of the recent trend of training smaller LLMs for longer:
harmdevries.com/post/model-siz…

Analysis in 🧵👇 Image The result follows from the Chinchilla scaling laws providing insight into the model size and compute overhead trade-off.

Let's start Chinchilla's 3rd approach: it models the loss L as a function of the number of parameters N and number of training tokens D. Image