ARC Prize Profile picture
A North Star for open AGI. Co-founders: @fchollet @mikeknoop. President: @gregkamradt. Help support the mission - make a donation today.
Dec 11 7 tweets 3 min read
A year ago, we verified a preview of an unreleased version of @OpenAI o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task

Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task

This represents a ~390X efficiency improvement in one year Image We also verified that GPT-5.2 Pro (High) is SOTA for ARC-AGI-2, scoring 54.2% for $15.72/task

(Due to API timeouts, we were unable to reliably verify GPT 5.2 Pro X-High on ARC-AGI-2)

All verified GPT-5.2 family scores:
arcprize.org/leaderboardImage
Dec 5 12 tweets 4 min read
Announcing the ARC Prize 2025 Top Score & Paper Award winners

The Grand Prize remains unclaimed

Our analysis on AGI progress marking 2025 the year of the refinement loop Image ARC Prize 2025 Paper Award Winners

1st / "Less is More: Recursive Reasoning with Tiny Networks" (TRM) / A. Jolicoeur-Martineau / $50k

2nd / "Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI" (SOAR) / J. Pourcel et al. / $20k

3rd / "ARC-AGI Without Pretraining" / I. Liao et al. / $5kImage
Nov 18 8 tweets 4 min read
Gemini 3 models from @Google @GoogleDeepMind have made a significant 2X SOTA jump on ARC-AGI-2 (Semi-Private Eval)

Gemini 3 Pro:
31.11%, $0.81/task

Gemini 3 Deep Think (Preview):
45.14%, $77.16/task Image Also, notable - Gemini 3 results on ARC-AGI-1 (Semi-Private Eval)

Gemini 3 Pro:
75.00%, $0.49/task

Gemini 3 Deep Think (Preview):
87.50%, $44.26/task Image
Aug 15 7 tweets 3 min read
Analyzing the Hierarchical Reasoning Model by @makingAGI

We verified scores on hidden tasks, ran ablations, and found that performance comes from an unexpected source

ARC-AGI Semi Private Scores:
* ARC-AGI-1: 32%
* ARC-AGI-2: 2%

Our 4 findings: Image Finding #1: The "hierarchical" architecture had minimal performance impact when compared to a similarly sized transformer

A drop-in transformer comes within a few points without any hyperparameter optimization.

See our full post: arcprize.org/blog/hrm-analy…Image
Jul 18 10 tweets 4 min read
Today, we're announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI

We’re releasing:
* 3 games (environments)
* $10K agent contest
* AI agents API

Starting scores - Frontier AI: 0%, Humans: 100% Image Every game environment is novel, unique, and only requires core-knowledge priors

No language, trivia, or specialized knowledge is needed to beat the games

* Play: three.arcprize.org
* Compete: arcprize.org/arc-agi/3/
* Build: three.arcprize.org/docs
Jul 10 4 tweets 2 min read
Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%

This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA Image On ARC-AGI-1, Grok 4 (Thinking) achieves 66.7% inline with the Pareto frontier for AI reasoning systems we reported last month Image
Apr 22 8 tweets 4 min read
o3 and o4-mini on ARC-AGI's Semi Private Evaluation

* o3-medium scores 53% on ARC-AGI-1
* o4-mini shows state-of-the-art efficiency
* ARC-AGI-2 remains virtually unsolved (<3%)

Through analysis we highlight differences from o3-preview and other model behavior Image As mentioned before, OpenAI has confirmed that the version of o3 that was released last week is not the same version that we tested in December ‘24.

For more on this see the tweet below or the blog post

Mar 24 14 tweets 5 min read
Today we are announcing ARC-AGI-2, an unsaturated frontier AGI benchmark that challenges AI reasoning systems (same relative ease for humans).

Grand Prize: 85%, ~$0.42/task efficiency

Current Performance:
* Base LLMs: 0%
* Reasoning Systems: <4% Image ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization in late 2024 demonstrated by OpenAI's o3 system.

Now, ARC-AGI-2 raises the bar significantly, challenging known test-time adaptation methods.

@MLStreetTalk is helping us launch ARC-AGI-2 with an interview of @mikeknoop & @fchollet.

Jan 21 4 tweets 1 min read
Verified DeepSeek performance on ARC-AGI's Public Eval (400 tasks) + Semi-Private (100 tasks)

DeepSeek V3:
* Semi-Private: 7.3% ($.002)
* Public Eval: 14% ($.002)

DeepSeek Reasoner:
* Semi-Private: 15.8% ($.06)
* Public Eval: 20.5% ($.05)

(Avg $ per task) Thank you to @rishab_partha for helping with this analysis

The purpose of the 100 Semi-Private Tasks is to provide a secondary hold out test set score.

The 400 Public Eval tasks were published in 2019. They have been widely studied and included in other model training data.
Dec 6, 2024 8 tweets 2 min read
ARC Prize remains unbeaten.

In 2024, SoTA moved from 33% to 55.5%.

Announcing: ARC Prize 2024 Winners & Technical Report. Image In part due to ARC Prize 2024, we believe AGI progress is no longer stalled.

But new ideas are still needed.

All scores below are open sourced & reproducible.

Winners and Technical Report: arcprize.org/2024-results