Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%
This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA
On ARC-AGI-1, Grok 4 (Thinking) achieves 66.7% inline with the Pareto frontier for AI reasoning systems we reported last month
Apr 22 • 8 tweets • 4 min read
o3 and o4-mini on ARC-AGI's Semi Private Evaluation
Through analysis we highlight differences from o3-preview and other model behavior
As mentioned before, OpenAI has confirmed that the version of o3 that was released last week is not the same version that we tested in December ‘24.
For more on this see the tweet below or the blog post
Today we are announcing ARC-AGI-2, an unsaturated frontier AGI benchmark that challenges AI reasoning systems (same relative ease for humans).
Grand Prize: 85%, ~$0.42/task efficiency
Current Performance:
* Base LLMs: 0%
* Reasoning Systems: <4%
ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization in late 2024 demonstrated by OpenAI's o3 system.
Now, ARC-AGI-2 raises the bar significantly, challenging known test-time adaptation methods.
@MLStreetTalk is helping us launch ARC-AGI-2 with an interview of @mikeknoop & @fchollet.
Jan 21 • 4 tweets • 1 min read
Verified DeepSeek performance on ARC-AGI's Public Eval (400 tasks) + Semi-Private (100 tasks)