Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

ARC Prize

@arcprize

Jul 18, 2025 • 10 tweets • 4 min read • Read on X

Scrolly

Today, we're announcing a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest gap between easy for humans and hard for AI

We’re releasing:
* 3 games (environments)
* $10K agent contest
* AI agents API

Starting scores - Frontier AI: 0%, Humans: 100%

Every game environment is novel, unique, and only requires core-knowledge priors

No language, trivia, or specialized knowledge is needed to beat the games

* Play: three.arcprize.org
* Compete: arcprize.org/arc-agi/3/
* Build: three.arcprize.org/docs

Your ability to efficiently adapt to novelty defines your intelligence, not your performance on a single-skill

Harder puzzles don’t prove smarter AI, but rather its ability to learn new rules does

ARC Prize exists to operationalize that insight

We create benchmarks that highlight the gap between human generalization and machine pattern‑matching

ARC‑AGI‑1 (2019) challenged deep‑learning

ARC‑AGI‑2 (2025) challenges static reasoning models

Agents are now the frontier. They perceive, plan, act, remember, adapt. Static puzzles aren’t equipped to grade that loop

We need interactive benchmarks that test world‑model building and long‑horizon planning under sparse feedback

Enter ARC‑AGI‑3

6 brand‑new games. Easy for humans, out of reach for today’s best AI models

3 games are live today, 3 will go live in August

ARC-AGI API ships today

Plug in any LLM, RL, or hybrid agent, train locally, test against our servers

o3 (left) and Grok 4 (right) replays below

spoiler: neither complete a single level

ARC-AGI-3 Preview games need to be pressure tested. We’re hosting a 30-day agent competition in partnership with @huggingface

We’re calling on the community to build agents (and win money!)

arcprize.org/competitions/a…

We're excited to share this preview of ARC-AGI-3. This is just the beginning

Visit to learn morearcprize.org/arc-agi/3

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @arcprize

ARC Prize

@arcprize

Dec 11, 2025

A year ago, we verified a preview of an unreleased version of @OpenAI o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task

Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task

This represents a ~390X efficiency improvement in one year

We also verified that GPT-5.2 Pro (High) is SOTA for ARC-AGI-2, scoring 54.2% for $15.72/task

(Due to API timeouts, we were unable to reliably verify GPT 5.2 Pro X-High on ARC-AGI-2)

All verified GPT-5.2 family scores:
arcprize.org/leaderboard

ARC-AGI is achieving its 2019 goal to push AI beyond memorization towards efficient on-the-fly adaptation

Reasoning systems now show genuine fluid intelligence on simple tasks

Read 7 tweets

ARC Prize

@arcprize

Dec 5, 2025

Announcing the ARC Prize 2025 Top Score & Paper Award winners

The Grand Prize remains unclaimed

Our analysis on AGI progress marking 2025 the year of the refinement loop

ARC Prize 2025 Paper Award Winners

1st / "Less is More: Recursive Reasoning with Tiny Networks" (TRM) / A. Jolicoeur-Martineau / $50k

2nd / "Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI" (SOAR) / J. Pourcel et al. / $20k

3rd / "ARC-AGI Without Pretraining" / I. Liao et al. / $5k

ARC Prize 2025 Top Score Winners

1st / NVARC / 25.03% / $25k
2nd / the ARChitects / 16.53% / $10k
3rd / MindsAI @ Tufa Labs / 12.64% / $5k
4th / Lonnie / 6.67% / $5k
5th / Guillermo Barbadillo @ Veridas / 6.53% / $5k

Read 12 tweets

ARC Prize

@arcprize

Nov 18, 2025

Gemini 3 models from @Google @GoogleDeepMind have made a significant 2X SOTA jump on ARC-AGI-2 (Semi-Private Eval)

Gemini 3 Pro:
31.11%, $0.81/task

Gemini 3 Deep Think (Preview):
45.14%, $77.16/task

Also, notable - Gemini 3 results on ARC-AGI-1 (Semi-Private Eval)

Gemini 3 Pro:
75.00%, $0.49/task

Gemini 3 Deep Think (Preview):
87.50%, $44.26/task

Frontier AI reasoning systems are now closing the complexity scaling gap between ARC-AGI-1 and ARC-AGI-2

This is surprising, as these same systems also make obvious mistakes on easy tasks (for humans) from ARC-AGI-1. We're not sure why and invite help from the community to study this phenomenon

Full solution logs are linked in last tweet

Read 8 tweets

ARC Prize

@arcprize

Aug 15, 2025

Analyzing the Hierarchical Reasoning Model by @makingAGI

We verified scores on hidden tasks, ran ablations, and found that performance comes from an unexpected source

ARC-AGI Semi Private Scores:
* ARC-AGI-1: 32%
* ARC-AGI-2: 2%

Our 4 findings:

Finding #1: The "hierarchical" architecture had minimal performance impact when compared to a similarly sized transformer

A drop-in transformer comes within a few points without any hyperparameter optimization.

See our full post: arcprize.org/blog/hrm-analy…

Finding #2: The under-documented “outer loop” refinement process drove substantial performance gains

Iterating over data by refining it has a strong impact, as the jump from 1 (no refinement) to 2 (1 refinement) shows.

Read 7 tweets

ARC Prize

@arcprize

Jul 10, 2025

Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%

This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA

On ARC-AGI-1, Grok 4 (Thinking) achieves 66.7% inline with the Pareto frontier for AI reasoning systems we reported last month

Thank you to the @xai team for working with us to validate Grok 4's score and inviting us to the watch the live stream

Read 4 tweets

ARC Prize

@arcprize

Apr 22, 2025

o3 and o4-mini on ARC-AGI's Semi Private Evaluation

* o3-medium scores 53% on ARC-AGI-1
* o4-mini shows state-of-the-art efficiency
* ARC-AGI-2 remains virtually unsolved (<3%)

Through analysis we highlight differences from o3-preview and other model behavior

https://x.com/arcprize/status/1912567067024453926

As mentioned before, OpenAI has confirmed that the version of o3 that was released last week is not the same version that we tested in December ‘24.

For more on this see the tweet below or the blog post

https://x.com/arcprize/status/1912567067024453926

During testing, both o3 and o4-mini frequently failed to return outputs when run at “high” reasoning.

The partial results we did receive are in the blog post.

However, these reasoning efforts were excluded from the leaderboard due to insufficient coverage.

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

ARC Prize

Try unrolling a thread yourself!

More from @arcprize

ARC Prize

ARC Prize

ARC Prize

ARC Prize

ARC Prize

ARC Prize

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!