Post

Franklyn Wang

@franklyn_wang

Nov 18 • 14 tweets • 4 min read • Read on X

Doubling o1-preview performance on ARC-AGI with one simple trick 🚀

tldr: by providing human-like representations to o1, we are able to substantially increase performance on @arcprize.

AI is really smart; its scores on math contests are no joke. But ARC Prize should be much easier than math contests, and yet frontier models generally do not score very well.

This is mainly because frontier models cannot really “see” the grid; imagine a genius trying to do ARC Prize with their eyes closed -- seems unlikely to bear fruit!

ARC Prize problems are not strictly well-posed -- there are many patterns that take every input to output, but only one mapping is evident to people.

You might ask, how do we get the representations? Do we use tree diffusion or autoencoding? The answer is no! We use an extensive set of handwritten heuristics to capture patterns that are salient to people.

For example, here's a simple case -- the algorithm recognizes the shape being re-colored.

We also show a more complicated example -- note our ability to handle augmentations!

We can even handle occlusions!

Because our method essentially represents each grid as an abstract syntax tree, we call our method Pattern Extraction and Abstraction for Cognitive Heuristics (PEACH) -- as peaches grow on trees, unlike other "reasoning" fruits.

Then, we simply ask frontier LLMs like o1 to code the mapping -- and find that a small handful of samples is enough for strong results, far fewer than the thousands used in prior art!

We find these results encouraging for solving this problem in a “human-like way”. We also emphasize that we do not fine tune any models, so there's plenty more juice!

All large efforts require a team. I’d like to thank @gopalkgoel1, @kattian_ , @minimario1729, Justin Zhang, @yunyu_l, @rahulgs, @cool_cocohearts, @fluorane, and @jacobtpl for all their helpful contributions, both conceptual and practical.

I'd also like to acknowledge Yunyu Lin (@yunyu_l) and David Petersen (@typesfaster) for financial support used to conduct this research.

But this is just the beginning! If interested, please DM me here to discuss potential collaborations.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

Franklyn Wang

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!