Post

https://x.com/__nmca__/status/1882563758788358396

More from @tamaybes

Tamay Besiroglu

@tamaybes

Dec 21, 2024

I’m excited to announce the development of Tier 4, a new suite of math problems that go beyond the hardest problems in FrontierMath. o3 is remarkable, but there’s still a ways to go before any single AI system nears the collective prowess of the math community.

https://x.com/tamaybes/status/1870335911264932326?s=46

FrontierMath currently spans three broad tiers:
• T1 (25%) Advanced, near top-tier undergrad/IMO
• T2 (50%) Needs serious grad-level background
• T3 (25%) Research problems demanding relevant research experience
All can take hours—or days—for experts to solve.

https://x.com/tamaybes/status/1870335911264932326?s=46

Tier 4 aims to push the boundary even further. We want to assemble problems so challenging that solving them would demonstrate capabilities on par with an entire top mathematics department.

Read 5 tweets

Tamay Besiroglu

@tamaybes

Dec 21, 2024

1/11 I’m genuinely impressed by OpenAI’s 25.2% Pass@1 performance on FrontierMath—this marks a major leap from prior results and arrives about a year ahead of my median expectations.

https://x.com/MatthewJBar/status/1855406568717664760

2/11 For context, FrontierMath is a brutally difficult benchmark with problems that would stump many mathematicians. The easier problems are as hard as IMO/Putnam; the hardest ones approach research-level complexity.

https://x.com/MatthewJBar/status/1855406568717664760

3/11 With earlier models like o1-preview, Pass@1 performance (solving on first attempt) was only around 2%. When allowing 8 attempts per problem (Pass@8) and counting problems solved at least once, we saw ~6% performance. o3's 25.2% at Pass@1 is substantially more impressive.

Read 12 tweets

Tamay Besiroglu

@tamaybes

May 16, 2024

https://twitter.com/tamaybes/status/1780639257389904013

A few weeks ago, we attempted to replicate the Chinchilla paper. We found that their estimated model fails to adequately fit the reconstructed data, that it implies inconsistent scaling policies, and that their confidence intervals are implausibly narrow.

https://twitter.com/tamaybes/status/1780639257389904013

https://twitter.com/borgeaud_s/status/1780988694163321250

The authors responded, clarifying that this was the result of their optimizer stopping early due to a bad loss scale choice. They plan to update their results and release the data. We appreciate @borgeaud_s and others' openness in addressing this issue.

https://twitter.com/borgeaud_s/status/1780988694163321250

This error is understandable. From experience, choosing the right optimizer and loss scale is often non-trivial, with no obvious error signs in case of poor convergence. I know at least another otherwise great paper that had a very similar issue.

Read 9 tweets

Tamay Besiroglu

@tamaybes

Apr 17, 2024

The Chinchilla scaling paper by Hoffmann et al. has been highly influential in the language modeling community. We tried to replicate a key part of their work and discovered discrepancies. Here's what we found. (1/9)

We reconstructed the data by extracting the SVG from the paper, parsing out the point locations & colors, mapping the coordinates to model size & FLOP, and mapping the colors to loss values. This let us closely approximate their original dataset from just the figure. (2/9)

When we fit their parametric scaling law, we get strikingly different estimates (Chi-squared p-value <1e-60!). The differences are significant for the data-scaling coefficient β and the irreducible loss E. (3/9)

Read 10 tweets

Tamay Besiroglu

@tamaybes

Mar 12, 2024

Language models have come a long way since 2012, when recurrent networks struggled to form coherent sentences. Our new paper finds that the compute needed to achieve a set performance level has been halving every 5 to 14 months on average. (1/10)

This rate of algorithmic progress is much faster than the two-year doubling time of Moore's Law for hardware improvements, and faster than other domains of software, like SAT-solvers, linear programs, etc. (2/10)

We estimate this using a dataset of over 200 language models from 2012 to 2023, evaluated on WikiText and Penn Treebank. By fitting a modified neural scaling law to this data, we estimate the rate of algorithmic efficiency improvements over time. (3/10)

Read 11 tweets

Tamay Besiroglu

@tamaybes

Dec 13, 2022

@EgeErdil2

How much progress in machine learning has been due to advances in algorithms (architectures, optimisers, activation functions, etc.), and how much as been due to the scaling of compute or datasets?
@EgeErdil2 and I provide new answers: arxiv.org/abs/2212.05153

We use a dataset of over a hundred computer vision models from the last decade to investigate how better algorithms and architectures have enabled researchers to use compute and data more efficiently.

We find that every 9 months, the introduction of better algorithms contribute the equivalent of a doubling of compute budgets. This is much faster than the gains from Moore’s law! That said, there's uncertainty (our 95% CI spans 4 to 25 months).

Read 6 tweets

Share this page!

Enter URL or ID to Unroll

Tamay Besiroglu

Try unrolling a thread yourself!

More from @tamaybes

Tamay Besiroglu

Tamay Besiroglu

Tamay Besiroglu

Tamay Besiroglu

Tamay Besiroglu

Tamay Besiroglu

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!