Research Scientist @ServiceNowRsrch @BigCodeProject
Apr 13, 2023 • 8 tweets • 3 min read
Surprised by the loss of LLaMA-7B still going down after 1 trillion tokens?
In a new blogpost, I explain why you shouldn't be and argue we haven't reached the limit of the recent trend of training smaller LLMs for longer: harmdevries.com/post/model-siz…
Analysis in 🧵👇
The result follows from the Chinchilla scaling laws providing insight into the model size and compute overhead trade-off.
Let's start Chinchilla's 3rd approach: it models the loss L as a function of the number of parameters N and number of training tokens D.