Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

gavin leech (Non-Reasoning)

@g_leech_

Feb 16 • 23 tweets • 5 min read • Read on X

Scrolly

New paper on a long-shot I've been obsessed with for a year:

How much are AI reasoning gains confounded by expanding the training corpus 10000x? How much LLM performance is down to "local" generalisation (pattern-matching to hard-to-detect semantically equivalent training data)?

tl;dr

* OLMo 3 training corpus contains exact duplicates of 50% of the ZebraLogic test set.

* We embed the corpus to find semantic duplicates of test data in the wild. 78% of the CodeForces test set had >=1 semantic duplicate

* The semantic duplicate rate is maybe >4 in 10000

arxiv.org/pdf/2602.12413

Imagine you're head of training at at OpenAI, and you want your benchmark scores to be meaningful (: to estimate OOD performance)

You have a hard task ahead of you! Your models have seen so much, memorisation is so easy - as is *local generalisation* (noisy pattern-matching).

What can you do? Well, obviously you take every benchmark you're going to test on and try to "decontaminate" your training corpus (remove test data from the training data).

By default this is just one level above string matching ("n-gram matching" - if sentences overlap in (say) a 13-token window, remove them from the training corpus).

But you're actually trying, so you also translate the test sets and delete translations of test from train.

But! every piece of test data has an arbitrary number of logical equivalents and neighbours (like how `x + y = 10` is the same problem as `2x + 2y = 20`). And LLMs are amazing at semantic search, so maybe this inflates benchmark scores.

The cutting-edge tech for detecting these "semantic" duplicates is... an LLM. But you simply can't do 100T x 1M calls. There's not enough compute in the world (yet).

So you do what you can - maybe you

* categorise the entire corpus & do intense search inside relevant partitions (e.g. maths > number theory > ...)
* embed the whole corpus & look for things really close to test data
* train a wee 300M filter model & do what you can with that

How much does this process catch? How many semantic duplicates of test data slip through? And what's the impact on final benchmark scores?

We don't know, This (finally) is where our paper comes in:

We experiment on OLMo 3, one of the only really good models with open training data.

Since we have its entire training corpus, we can exhaustively check for real "natural" duplicates and finetune it to estimate their impact. We embed the entire Dolma Instruct corpus.

Firstly: we were surprised by how ineffective n-gram decontamination was at catching exact duplicates - 70% of harder tasks had a match. But the spurious performance gain wasn't so large, at most +4pp.

Secondly, every single MBPP test example and 78% of CodeForces have semantic duplicates

Thirdly we generated 10k synthetic duplicates for MuSR, Zebralogic, and MBPP problems and finetuned on them.

* MuSR +22pp. Semantic duplicates as strong as exact
* ZebraLogic +12pp. Exact much stronger
* MBPP +17pp. Exact stronger

Fourthly we guess that 4 in 10,000 training datapoints are a strong semantic duplicate for a given benchmark datapoint (where strong means just "obvious to Gemini")

So:

n-gram decontamination is not enough even for the easy (exact) stuff, semantic duplicates are at least a moderately big deal, and this probably transfers to frontier models to some degree.

The above are probably underestimates too (since our detection pipeline was cheapo).

Data contamination is a huge field. Here's how we're new

This is preliminary work on a shoestring - we didn't get at the big questions yet ("what share of benchmark gains come from interpolation over a hidden training corpus?", "does this even matter?")

And local generalisation across very different strings is anyway pretty miraculous

The grand aim of this research programme is to decompose benchmark gains / apparent AI progress into 4 estimates:

1. benchmaxxing (memorising exact duplicates)
2. usemaxxing (RLing narrow capabilities)
3. hidden interpolation / local generalisation
4. OOD generalisation

We have a lot of ideas! If you're interested in funding this, grab me at gavin@arbresearch.com

Nearly all of the real work done by Ari Spiesberger, @Juan_VaGu, Nicky Pochinkov, Tomas Gavenciak, @peligrietzer and @NandiSchoots

And ofc this work wouldn't be possible without @allen_ai @natolambert working in public and enabling actually scientific evals.

* at least 50% and at least 78% that is

@Juan_VaGu @peligrietzer @NandiSchoots @allen_ai @natolambert code

github.com/AriSpiesberger…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @g_leech_

gavin leech (Non-Reasoning)

@g_leech_

Dec 16, 2025

breakthroughs of the year

This year, the list is fancy. For the full 201 results, check out the dash. You can sort by probability, bigness, and their product.

renaissancephilanthropy.org/frontier2025

A selection:

Read 27 tweets

gavin leech (Non-Reasoning)

@g_leech_

Dec 7, 2025

My summary of the year in AI

lesswrong.com/posts/Q9ewXs8p…

Read 12 tweets

gavin leech (Non-Reasoning)

@g_leech_

Oct 3, 2025

https://twitter.com/littmath/status/1974103955476242717

I've been collecting examples of actually useful AI in research maths:

(I'm not disagreeing with Ege & Daniel's claim, which is carefully phrased as "[novel] concept")

https://twitter.com/littmath/status/1974103955476242717

In 2023, Tao used Copilot for one line of the formalisation of Polynomial Freiman-Ruzsa.

"it offered a suggestion which was almost correct"

terrytao.wordpress.com/2023/11/18/for…

https://x.com/robertghrist/status/1970886893635047777

@robertghrist flow-cut duality from 2024.

an AI conjecture followed by an o1 full proof on the first try (using a prompt iterated on over 6 weeks of wrong AI proofs)

https://x.com/robertghrist/status/1970886893635047777

Read 9 tweets

gavin leech (Non-Reasoning)

@g_leech_

May 12, 2025

some ML talks

on METR's task time horizon estimates (Rein et al 2025)

docs.google.com/presentation/d…

the recent history of post-training, RLVR, and the Tsinghua sceptical hypothesis. Is RL on CoT just elicitation and narrowing? (Yue et al 2025)

docs.google.com/presentation/d…

Read 4 tweets

gavin leech (Non-Reasoning)

@g_leech_

Feb 14, 2025

https://twitter.com/EpochAIResearch/status/1890173317224575042

Stross (fictional): "Around the world, laboring women produce 45000 babies a day, representing 10^23 MIPS of processing power... fab lines casually churn out 30M microprocessors a day, 10^23... most of the MIPS added to the solar system will be machine-hosted for the first time."

https://twitter.com/EpochAIResearch/status/1890173317224575042

So when will the Accelerando moment actually be?

Plotting (OWID births * 1e15 brain anchor) we get a *very* approximate point estimate:

If there's 1.3e23 human FLOP/s added per year
and if the 2.3x/yr NVIDIA output scaling continues

Then NVIDIA crossover on their own by 2030.

(Very flawed estimate but there are at least two conservative assumptions:

1. just NVIDIA; Google are adding about the same

2. I'm ignoring the huge (predominating) CPU FLOP/s here because that's not the most relevant for intelligence and it is very annoying to research.)

Read 8 tweets

gavin leech (Non-Reasoning)

@g_leech_

Jan 12, 2025

blogs I found this year

"What might become more valuable

Trustworthiness
Having a real audience
Doing things in real life
Craftspeople
Specializing
Being funny
Mental health
Agency and resourcefulness
Top ~10 percent creativity
Having good taste
Being adaptable
Kindness"

quarter--mile.com

"All productivity tools exist on a Neuralink <-> GPT coordinate plane...

The mouse, Superhuman, Office, PDF Readers, Figma, VR/AR

[vs]

Books, programming languages, the internet, Notion, GitHub"

blog.samkececi.com

Read 7 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

gavin leech (Non-Reasoning)

Try unrolling a thread yourself!

More from @g_leech_

gavin leech (Non-Reasoning)

gavin leech (Non-Reasoning)

gavin leech (Non-Reasoning)

gavin leech (Non-Reasoning)

gavin leech (Non-Reasoning)

gavin leech (Non-Reasoning)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!