Thread by @ArtificialAnlys on Thread Reader App

xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model.

We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68. Full results breakdown below.

This is the first time that @elonmusk's @xai has the lead the AI frontier. Grok 3 scored competitively with the latest models from OpenAI, Anthropic and Google - but Grok 4 is the first time that our Intelligence Index has shown xAI in first place.

We tested Grok 4 via the xAI API. The version of Grok 4 deployed for use on X/Twitter may be different to the model available via API. Consumer application versions of LLMs typically have instructions and logic around the models that can change style and behavior.

Grok 4 is a reasoning model, meaning it ‘thinks’ before answering. The xAI API does not share reasoning tokens generated by the model.

Grok 4’s pricing is equivalent to Grok 3 at $3/$15 per 1M input/output tokens ($0.75 per 1M cached input tokens). The per-token pricing is identical to Claude 4 Sonnet, but more expensive than Gemini 2.5 Pro ($1.25/$10, for <200K input tokens) and o3 ($2/$8, after recent price decrease). We expect Grok 4 to be available via the xAI API, via the Grok chatbot on X, and potentially via Microsoft Azure AI Foundry (Grok 3 and Grok 3 mini are currently available on Azure).

Key benchmarking results:
➤ Grok 4 leads in not only our Artificial Analysis Intelligence Index but also our Coding Index (LiveCodeBench & SciCode) and Math Index (AIME24 & MATH-500)
➤ All-time high score in GPQA Diamond of 88%, representing a leap from Gemini 2.5 Pro’s previous record of 84%
➤ All-time high score in Humanity’s Last Exam of 24%, beating Gemini 2.5 Pro’s previous all-time high score of 21%. Note that our benchmark suite uses the original HLE dataset (Jan '25) and runs the text-only subset with no tools
➤ Joint highest score for MMLU-Pro and AIME 2024 of 87% and 94% respectively
➤ Speed: 75 output tokens/s, slower than o3 (188 tokens/s), Gemini 2.5 Pro (142 tokens/s), Claude 4 Sonnet Thinking (85 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s)

Other key information:
➤ 256k token context window. This is below Gemini 2.5 Pro’s context window of 1 million tokens, but ahead of Claude 4 Sonnet and Claude 4 Opus (200k tokens), o3 (200k tokens) and R1 0528 (128k tokens)
➤ Supports text and image input
➤ Supports function calling and structured outputs

See below for further analysis 👇

Grok 4 scores higher in Artificial Analysis Intelligence Index than any other model. Its pricing is higher than OpenAI’s o3, Google’s Gemini 2.5 Pro and Anthropic’s Claude 4 Sonnet - but lower than Anthropic’s Claude 4 Opus and OpenAI’s o3-pro.

Full set of intelligence benchmarks that we have run independently on xAI’s Grok 4 API:

Grok 4 recorded slightly higher output token usage compared to peer models when running the Artificial Analysis Intelligence Index. This translates to higher cost relative to its per token price.

xAI’s API is serving Grok 4 at 75 tokens/s. This is slower than o3 (188 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s).

Grok 4 is now live on Artificial Analysis: artificialanalysis.ai

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll