Latest Twitter Threads by @tamaybes on Thread Reader App

Mar 20 • 5 tweets • 2 min read

We should be cautious interpreting the METR paper’s results—these ‘time horizons’ depend heavily on which tasks we pick.

As a parallel, I ran a similar analysis on chess and found it can predict AI operating on decade‐long timescales.

https://twitter.com/METR_Evals/status/1902384481111322929

The idea is to define the ‘time horizon’ a human club player needs to match AI moves. Early AIs were easy to outplay quickly, but as you go up to 2400 ELO engines, you need more thinking time—and matching Stockfish might take years per move!

Jan 23 • 7 tweets • 2 min read

1/6 We haven't communicated clearly enough about FrontierMath's relationship with OpenAI, and I want to own that. By not being transparent from the start, we caused confusion for contributors, researchers, and the public. 2/6 OpenAI commissioned Epoch AI to produce 300 math problems for FrontierMath. Because it was a commissioned project, OpenAI owns those problems. They have access to the statements and solutions—except for a 50-question holdout set we're finalizing.

Dec 21, 2024 • 5 tweets • 2 min read

I’m excited to announce the development of Tier 4, a new suite of math problems that go beyond the hardest problems in FrontierMath. o3 is remarkable, but there’s still a ways to go before any single AI system nears the collective prowess of the math community. FrontierMath currently spans three broad tiers:
• T1 (25%) Advanced, near top-tier undergrad/IMO
• T2 (50%) Needs serious grad-level background
• T3 (25%) Research problems demanding relevant research experience
All can take hours—or days—for experts to solve.

https://x.com/tamaybes/status/1870335911264932326?s=46

Dec 21, 2024 • 12 tweets • 3 min read

1/11 I’m genuinely impressed by OpenAI’s 25.2% Pass@1 performance on FrontierMath—this marks a major leap from prior results and arrives about a year ahead of my median expectations.

2/11 For context, FrontierMath is a brutally difficult benchmark with problems that would stump many mathematicians. The easier problems are as hard as IMO/Putnam; the hardest ones approach research-level complexity.

https://x.com/MatthewJBar/status/1855406568717664760

Dec 20, 2024 • 5 tweets • 1 min read

I’d like to acknowledge @OpenAI’s support in creating FrontierMath. They recently provided permission to publicly share this support. Their feedback helped strengthen FrontierMath. OpenAI encouraged us to push for significantly greater difficulty, which I believe has made the benchmark more valuable.

May 16, 2024 • 9 tweets • 3 min read

A few weeks ago, we attempted to replicate the Chinchilla paper. We found that their estimated model fails to adequately fit the reconstructed data, that it implies inconsistent scaling policies, and that their confidence intervals are implausibly narrow.

https://twitter.com/tamaybes/status/1780639257389904013

The authors responded, clarifying that this was the result of their optimizer stopping early due to a bad loss scale choice. They plan to update their results and release the data. We appreciate @borgeaud_s and others' openness in addressing this issue.

https://twitter.com/borgeaud_s/status/1780988694163321250

Apr 17, 2024 • 10 tweets • 4 min read

The Chinchilla scaling paper by Hoffmann et al. has been highly influential in the language modeling community. We tried to replicate a key part of their work and discovered discrepancies. Here's what we found. (1/9)

We reconstructed the data by extracting the SVG from the paper, parsing out the point locations & colors, mapping the coordinates to model size & FLOP, and mapping the colors to loss values. This let us closely approximate their original dataset from just the figure. (2/9)

Mar 12, 2024 • 11 tweets • 4 min read

Language models have come a long way since 2012, when recurrent networks struggled to form coherent sentences. Our new paper finds that the compute needed to achieve a set performance level has been halving every 5 to 14 months on average. (1/10)

This rate of algorithmic progress is much faster than the two-year doubling time of Moore's Law for hardware improvements, and faster than other domains of software, like SAT-solvers, linear programs, etc. (2/10)

Dec 13, 2022 • 6 tweets • 3 min read

How much progress in machine learning has been due to advances in algorithms (architectures, optimisers, activation functions, etc.), and how much as been due to the scaling of compute or datasets?
@EgeErdil2 and I provide new answers: arxiv.org/abs/2212.05153 We use a dataset of over a hundred computer vision models from the last decade to investigate how better algorithms and architectures have enabled researchers to use compute and data more efficiently.

Jun 20, 2022 • 13 tweets • 6 min read

I recently organized a contest for @Metaculus on investigations into predictions of the future of AI. This resulted in two-dozen insightful analyses by forecasters into the prospects of transformatively advanced AI systems. Here are my short summaries of some that stood out: This piece by @EgeErdil2 uses a hyperbolic growth model to argue that an economy could be transformed fairly quickly following the widespread deployment of advanced AI
metaculus.com/notebooks/1061…

Feb 22, 2021 • 7 tweets • 2 min read

A recent paper about innovation over the long run reveals a very neat snapshot of the composition of inventions over time. Using data on US patents, it identifies the following key waves:
nber.org/system/files/w…

1840s—70s: Key manufacturing innovations occur (pneumatic process for cheap steel and sewing machine are invented); Transport (improvements in steam-engines. The Bollman bridge, air brake system, cable car are patented); Consumer Goods (board game, toothbrush, picture machine).

Nov 22, 2020 • 12 tweets • 4 min read

A few months ago, I wrote an economics dissertation on whether machine learning models are getting harder to find. Here’s a summary of what I found: Some background. @ChadJonesEcon, @johnvanreenen and others wrote an awesome article that found that ideas are getting harder to find: in semiconductors, agricultural production and medicine, research productivity has been declining steadily.

Share this page!

Enter URL or ID to Unroll