ARC Prize Profile picture
Oct 9 2 tweets 2 min read Read on X
New ARC-AGI SOTA: GPT-5 Pro

- ARC-AGI-1: 70.2%, $4.78/task
- ARC-AGI-2: 18.3%, $7.41/task

@OpenAI’s GPT-5 Pro now holds the highest verified frontier LLM score on ARC-AGI’s Semi-Private benchmarkImage
Image
View the leaderboard: arcprize.org/leaderboard
View GPT-5 Pro's responses: huggingface.co/datasets/arcpr…
Reproduce the results: github.com/arcprize/arc-a…
Learn more about our testing policy: arcprize.org/policy

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with ARC Prize

ARC Prize Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @arcprize

Aug 15
Analyzing the Hierarchical Reasoning Model by @makingAGI

We verified scores on hidden tasks, ran ablations, and found that performance comes from an unexpected source

ARC-AGI Semi Private Scores:
* ARC-AGI-1: 32%
* ARC-AGI-2: 2%

Our 4 findings: Image
Finding #1: The "hierarchical" architecture had minimal performance impact when compared to a similarly sized transformer

A drop-in transformer comes within a few points without any hyperparameter optimization.

See our full post: arcprize.org/blog/hrm-analy…Image
Finding #2: The under-documented “outer loop” refinement process drove substantial performance gains

Iterating over data by refining it has a strong impact, as the jump from 1 (no refinement) to 2 (1 refinement) shows. Image
Read 7 tweets
Jul 18
Today, we're announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI

We’re releasing:
* 3 games (environments)
* $10K agent contest
* AI agents API

Starting scores - Frontier AI: 0%, Humans: 100% Image
Every game environment is novel, unique, and only requires core-knowledge priors

No language, trivia, or specialized knowledge is needed to beat the games

* Play: three.arcprize.org
* Compete: arcprize.org/arc-agi/3/
* Build: three.arcprize.org/docs
Your ability to efficiently adapt to novelty defines your intelligence, not your performance on a single-skill

Harder puzzles don’t prove smarter AI, but rather its ability to learn new rules does

ARC Prize exists to operationalize that insight
Read 10 tweets
Jul 10
Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%

This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA Image
On ARC-AGI-1, Grok 4 (Thinking) achieves 66.7% inline with the Pareto frontier for AI reasoning systems we reported last month Image
Thank you to the @xai team for working with us to validate Grok 4's score and inviting us to the watch the live stream Image
Read 4 tweets
Apr 22
o3 and o4-mini on ARC-AGI's Semi Private Evaluation

* o3-medium scores 53% on ARC-AGI-1
* o4-mini shows state-of-the-art efficiency
* ARC-AGI-2 remains virtually unsolved (<3%)

Through analysis we highlight differences from o3-preview and other model behavior Image
As mentioned before, OpenAI has confirmed that the version of o3 that was released last week is not the same version that we tested in December ‘24.

For more on this see the tweet below or the blog post

During testing, both o3 and o4-mini frequently failed to return outputs when run at “high” reasoning.

The partial results we did receive are in the blog post.

However, these reasoning efforts were excluded from the leaderboard due to insufficient coverage.
Read 8 tweets
Mar 24
Today we are announcing ARC-AGI-2, an unsaturated frontier AGI benchmark that challenges AI reasoning systems (same relative ease for humans).

Grand Prize: 85%, ~$0.42/task efficiency

Current Performance:
* Base LLMs: 0%
* Reasoning Systems: <4% Image
ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization in late 2024 demonstrated by OpenAI's o3 system.

Now, ARC-AGI-2 raises the bar significantly, challenging known test-time adaptation methods.

@MLStreetTalk is helping us launch ARC-AGI-2 with an interview of @mikeknoop & @fchollet.

Our belief is that once we can no longer come up with quantifiable problems that are relatively easy for humans, yet hard for AI, we have reached AGI.

ARC-AGI-2 proves that we do not have AGI. New ideas are still needed!
Read 14 tweets
Jan 21
Verified DeepSeek performance on ARC-AGI's Public Eval (400 tasks) + Semi-Private (100 tasks)

DeepSeek V3:
* Semi-Private: 7.3% ($.002)
* Public Eval: 14% ($.002)

DeepSeek Reasoner:
* Semi-Private: 15.8% ($.06)
* Public Eval: 20.5% ($.05)

(Avg $ per task)
Thank you to @rishab_partha for helping with this analysis

The purpose of the 100 Semi-Private Tasks is to provide a secondary hold out test set score.

The 400 Public Eval tasks were published in 2019. They have been widely studied and included in other model training data.
The score difference between Semi-Private and Public Eval is in line with what has been observed with other models.

You can see more model scores on our 2024 wrap up

arcprize.org/2024-results
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(