Alexandr Wang Profile picture
May 2 2 tweets 1 min read Read on X
How overfit are popular LLMs on public benchmarks?

New research out of @scale_ai SEAL to answer this:

- produced a new eval GSM1k
- evaluated public LLMs for overfitting on GSM8k

VERDICT: Mistral & Phi are overfitting benchmarks, while GPT, Claude, Gemini, and Llama are not. Image
h/t to our incredible team for this research:
@hughbzhang @summeryue0, @_jeffda, Dean Lee Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, @seanh, Russell Kaplan, @mikelunati

paper link: arxiv.org/abs/2405.00332

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Alexandr Wang

Alexandr Wang Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @alexandr_wang

Jun 9
1/ one of the biggest questions in AI today is:

since GPT-4 was trained in fall 2022, we've collectively spent ~$100B on NVIDIA GPUs

will the next generation of AI models' capabilities live up to that aggregate investment level?

NVIDIA qtrly datacenter rev, by @Thomas_Woodside Image
2/ there are 2 schools of thought:

1) compute is the only real bottleneck to AI progress. the more we spend, the closer we get to AGI

2) we are hitting a data wall which will slow progress regardless of how much compute we have

3/ the original scaling laws require a scaling of data alongside compute, and while you can still improve loss with more compute, it is much less efficient than if you scaled data as well Image
Read 9 tweets
May 29
1/ We are launching SEAL Leaderboards—private, expert evaluations of leading frontier models.

Our design principles:
🔒Private + Unexploitable. No overfitting on evals!
🎓Domain Expert Evals
🏆Continuously Updated w/new Data and Models

Read more in 🧵

scale.com/leaderboard


Image
Image
Image
2/ Evaluations are a critical component of the AI ecosystem.

Evals are incentives for researchers, and our evaluations set the goals for how we aim to improve our models.

Trusted 3rd party evals are a missing part of the whole ecosystem, which is why @scale_AI built these.
3/ We eval'd many of the leading models:

- GPT-4o
- GPT-4 Turbo
- Claude 3 Opus
- Gemini 1.5 Pro
- Gemini 1.5 Flash
- Llama3
- Mistral Large

On Coding, Math, Instruction Following, and Multilinguality (Spanish).

See leaderboard results below.


Image
Image
Image
Image
Read 5 tweets
May 28
1/ Today is the 4th anniversary of the original GPT-3 paper—"Language Models are Few-Shot Learners"

Some reflections on how the last 4 years have played out, and thoughts about the next 4 years
2/ GPT-3 was when it first became clear what the potential of scaling language models was.

The efficacy of GPT-3 took the AI community by surprise for the most part—the capabilities were staggering compared to everything that came before in NLP.
3/ @scale_AI had started working on language models the year before on the very first RLHF experiments on GPT-2. But GPT-2 still felt very much like a toy at that time.

GPT-3 was the first moment where it was obvious this would be the major theme of AI
Read 9 tweets
May 16
1/ Some thoughts on the recent OpenAI and Google announcements, and what it indicates about what's next in AI.

Hint: post-training is REALLY important...

THREAD
2/ In many ways, Gemini 1.5 Flash was the gem of Google's announcements. A 1M-context small model with Flash performance is incredible.

OpenAI now has the best large model with GPT-4o, and Google has the best small model with Gemini 1.5 Flash.

The competition is on.
3/ Regardless, the level of convergence is fascinating—the similarity between 4o and Astra, Veo and Sora, etc. Both labs seem to be following relatively similar technical trajectories.

IMO, it's better for the industry for divergence rather than convergence. Alas...
Read 10 tweets
Jan 1
I'm posting some of my learnings from 2023, AI's biggest year yet.

🧵 for some highlights and link to post Image
LEARNING 1: The conceit of an expert is a trap. Strive for a beginner’s mind and the energy of a novice.

Experience can often be a curse—the past is only mildly predictive of the future, and every scenario requires new techniques and insight. In novel situations, the novice tends to be at an advantage—their vitality and beginner’s mind lend themselves to faster adaptation.
LEARNING 4: Advice tends to be 90% wrong and 10% right.

Most advice is horribly wrong. But there are always nuggets of truth which can be extremely helpful. The key to advice is to find the nuggets.
Read 5 tweets
Jul 18, 2023
With @MetaAI's the launch of Llama 2—@scale_ai will also be:

🌎 open-sourcing scale-llm-engine, our library for hosting and fine-tuning open-source LLMs
⚡️ releasing the fastest way to fine-tune Llama 2
💼 launching Scale Custom LLMs for enterprises

Read more in 🧵

Image
Image
Image
We are open-sourcing scale-llm-engine, our library for hosting and fine-tuning open-source LLMs.

This can run on your own infra, as well as on Scale's cloud infrastructure.

Docs here:


Github link:
https://t.co/mrguGYVEAoscaleapi.github.io/llm-engine/
github.com/scaleapi/llm-e…
We're also releasing a cookbook for fine-tuning Llama 2 in 5 minutes 🔥

Get started here:


Cookbook here:
https://t.co/ACy7wwvfnbscaleapi.github.io/llm-engine/gui…
github.com/scaleapi/llm-e…
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(