Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Tamay Besiroglu

@tamaybes

Apr 17, 2024 • 10 tweets • 4 min read • Read on X

Scrolly

The Chinchilla scaling paper by Hoffmann et al. has been highly influential in the language modeling community. We tried to replicate a key part of their work and discovered discrepancies. Here's what we found. (1/9)

We reconstructed the data by extracting the SVG from the paper, parsing out the point locations & colors, mapping the coordinates to model size & FLOP, and mapping the colors to loss values. This let us closely approximate their original dataset from just the figure. (2/9)

When we fit their parametric scaling law, we get strikingly different estimates (Chi-squared p-value <1e-60!). The differences are significant for the data-scaling coefficient β and the irreducible loss E. (3/9)

Hoffmann et al.'s estimated scaling law fits the reconstructed data very poorly compared to ours. Their residuals are not centered at 0 at all! Our model achieves a lower loss on 98% of data points. Clearly, their model does not fit the data. (4/9)

Hoffmann et al. also report extremely narrow confidence intervals for some key parameters. We calculate that you’d need about 600,000 data points to nail it down that precisely. By contrast, they likely had ~400. (5/9)

Moreover, Hoffmann et al.'s estimates imply a scaling policy inconsistent with their other results and the token-to-parameter ratio used for Chinchilla. Our estimates align better with these and have more reasonable uncertainty. (6/9)

Hoffmann et al.’s paper has been highly influential in the language modeling community. Our analysis highlights some potential issues that warrant clarification. (7/9)

We have asked the authors for assistance, but we haven’t been able to get a response. (8/9)

Here is a short preprint that describes our findings in more detail: (9/9)

Worked on this togther with @EgeErdil2 , @MatthewJBar, and @justjoshinyou13.arxiv.org/abs/2404.10102

You can reproduce all our work:
Extracted data:
Code to reproduce results:
Code to extract data from SVG: github.com/Besiroglu/data…
colab.research.google.com/drive/1VAVVYRK…
colab.research.google.com/drive/1ROmEyJH…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @tamaybes

Tamay Besiroglu

@tamaybes

Mar 20

https://twitter.com/METR_Evals/status/1902384481111322929

We should be cautious interpreting the METR paper’s results—these ‘time horizons’ depend heavily on which tasks we pick.

As a parallel, I ran a similar analysis on chess and found it can predict AI operating on decade‐long timescales.

https://twitter.com/METR_Evals/status/1902384481111322929

The idea is to define the ‘time horizon’ a human club player needs to match AI moves. Early AIs were easy to outplay quickly, but as you go up to 2400 ELO engines, you need more thinking time—and matching Stockfish might take years per move!

I used a simple scaling law, ELO = a + b·log(Time), to estimate how human thinking time must scale to keep up with AI performance. Fitting it to Chess. com data gives a very rough forecast.

Read 5 tweets

Tamay Besiroglu

@tamaybes

Jan 23

1/6 We haven't communicated clearly enough about FrontierMath's relationship with OpenAI, and I want to own that. By not being transparent from the start, we caused confusion for contributors, researchers, and the public.

2/6 OpenAI commissioned Epoch AI to produce 300 math problems for FrontierMath. Because it was a commissioned project, OpenAI owns those problems. They have access to the statements and solutions—except for a 50-question holdout set we're finalizing.

3/6 Epoch AI is free to conduct and publish evaluations of any models using the benchmark, as we have done already. We retain this right to evaluate models independently.

Read 7 tweets

Tamay Besiroglu

@tamaybes

Dec 21, 2024

I’m excited to announce the development of Tier 4, a new suite of math problems that go beyond the hardest problems in FrontierMath. o3 is remarkable, but there’s still a ways to go before any single AI system nears the collective prowess of the math community.

https://x.com/tamaybes/status/1870335911264932326?s=46

FrontierMath currently spans three broad tiers:
• T1 (25%) Advanced, near top-tier undergrad/IMO
• T2 (50%) Needs serious grad-level background
• T3 (25%) Research problems demanding relevant research experience
All can take hours—or days—for experts to solve.

https://x.com/tamaybes/status/1870335911264932326?s=46

Tier 4 aims to push the boundary even further. We want to assemble problems so challenging that solving them would demonstrate capabilities on par with an entire top mathematics department.

Read 5 tweets

Tamay Besiroglu

@tamaybes

Dec 21, 2024

1/11 I’m genuinely impressed by OpenAI’s 25.2% Pass@1 performance on FrontierMath—this marks a major leap from prior results and arrives about a year ahead of my median expectations.

https://x.com/MatthewJBar/status/1855406568717664760

2/11 For context, FrontierMath is a brutally difficult benchmark with problems that would stump many mathematicians. The easier problems are as hard as IMO/Putnam; the hardest ones approach research-level complexity.

https://x.com/MatthewJBar/status/1855406568717664760

3/11 With earlier models like o1-preview, Pass@1 performance (solving on first attempt) was only around 2%. When allowing 8 attempts per problem (Pass@8) and counting problems solved at least once, we saw ~6% performance. o3's 25.2% at Pass@1 is substantially more impressive.

Read 12 tweets

Tamay Besiroglu

@tamaybes

Dec 20, 2024

I’d like to acknowledge @OpenAI’s support in creating FrontierMath. They recently provided permission to publicly share this support.

Their feedback helped strengthen FrontierMath. OpenAI encouraged us to push for significantly greater difficulty, which I believe has made the benchmark more valuable.

I’m excited for us to continue conducting our own independent evaluations, which we expect will accurately reflect model capabilities across various labs.

Read 5 tweets

Tamay Besiroglu

@tamaybes

May 16, 2024

https://twitter.com/tamaybes/status/1780639257389904013

A few weeks ago, we attempted to replicate the Chinchilla paper. We found that their estimated model fails to adequately fit the reconstructed data, that it implies inconsistent scaling policies, and that their confidence intervals are implausibly narrow.

https://twitter.com/tamaybes/status/1780639257389904013

https://twitter.com/borgeaud_s/status/1780988694163321250

The authors responded, clarifying that this was the result of their optimizer stopping early due to a bad loss scale choice. They plan to update their results and release the data. We appreciate @borgeaud_s and others' openness in addressing this issue.

https://twitter.com/borgeaud_s/status/1780988694163321250

This error is understandable. From experience, choosing the right optimizer and loss scale is often non-trivial, with no obvious error signs in case of poor convergence. I know at least another otherwise great paper that had a very similar issue.

Read 9 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Tamay Besiroglu

Try unrolling a thread yourself!

More from @tamaybes

Tamay Besiroglu

Tamay Besiroglu

Tamay Besiroglu

Tamay Besiroglu

Tamay Besiroglu

Tamay Besiroglu

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!