How overfit are popular LLMs on public benchmarks?
New research out of @scale_ai SEAL to answer this:
- produced a new eval GSM1k
- evaluated public LLMs for overfitting on GSM8k
VERDICT: Mistral & Phi are overfitting benchmarks, while GPT, Claude, Gemini, and Llama are not.
h/t to our incredible team for this research:
@hughbzhang @summeryue0, @_jeffda, Dean Lee Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, @seanh, Russell Kaplan, @mikelunati
3/ the original scaling laws require a scaling of data alongside compute, and while you can still improve loss with more compute, it is much less efficient than if you scaled data as well
1/ Today is the 4th anniversary of the original GPT-3 paper—"Language Models are Few-Shot Learners"
Some reflections on how the last 4 years have played out, and thoughts about the next 4 years
2/ GPT-3 was when it first became clear what the potential of scaling language models was.
The efficacy of GPT-3 took the AI community by surprise for the most part—the capabilities were staggering compared to everything that came before in NLP.
3/ @scale_AI had started working on language models the year before on the very first RLHF experiments on GPT-2. But GPT-2 still felt very much like a toy at that time.
GPT-3 was the first moment where it was obvious this would be the major theme of AI
1/ Some thoughts on the recent OpenAI and Google announcements, and what it indicates about what's next in AI.
Hint: post-training is REALLY important...
THREAD
2/ In many ways, Gemini 1.5 Flash was the gem of Google's announcements. A 1M-context small model with Flash performance is incredible.
OpenAI now has the best large model with GPT-4o, and Google has the best small model with Gemini 1.5 Flash.
The competition is on.
3/ Regardless, the level of convergence is fascinating—the similarity between 4o and Astra, Veo and Sora, etc. Both labs seem to be following relatively similar technical trajectories.
IMO, it's better for the industry for divergence rather than convergence. Alas...
I'm posting some of my learnings from 2023, AI's biggest year yet.
🧵 for some highlights and link to post
LEARNING 1: The conceit of an expert is a trap. Strive for a beginner’s mind and the energy of a novice.
Experience can often be a curse—the past is only mildly predictive of the future, and every scenario requires new techniques and insight. In novel situations, the novice tends to be at an advantage—their vitality and beginner’s mind lend themselves to faster adaptation.
LEARNING 4: Advice tends to be 90% wrong and 10% right.
Most advice is horribly wrong. But there are always nuggets of truth which can be extremely helpful. The key to advice is to find the nuggets.
With @MetaAI's the launch of Llama 2—@scale_ai will also be:
🌎 open-sourcing scale-llm-engine, our library for hosting and fine-tuning open-source LLMs
⚡️ releasing the fastest way to fine-tune Llama 2
💼 launching Scale Custom LLMs for enterprises
Read more in 🧵
We are open-sourcing scale-llm-engine, our library for hosting and fine-tuning open-source LLMs.
This can run on your own infra, as well as on Scale's cloud infrastructure.