Post

More from @ArtificialAnlys

Artificial Analysis

@ArtificialAnlys

Jul 8

SpaceXAI’s Grok 4.5 scores 54 to place fourth on the Artificial Analysis Intelligence Index following only Fable 5, GPT-5.5, and Opus 4.8. It scores on par with GPT-5.5 in Codex on the Artificial Analysis Coding Agent Index in the Grok Build harness, at much lower cost

Grok 4.5 improves 16 points over Grok 4.3 on the Intelligence Index, bringing SpaceXAI to the intelligence frontier behind only OpenAI and Anthropic, and outperforming all open weights models and notably Google’s Gemini models. Key standout areas of performance are agentic knowledge work and coding.

Grok 4.5 in Grok Build scores 76 on the Artificial Analysis Coding Agent Index, on par with GPT-5.5 (xhigh) in Codex and just below Fable 5 (max) in Claude Code, and at a small fraction of the token usage and price.

Congratulations to @SpaceXAI, @cursor_ai, and @elonmusk on the impressive release!

Key Takeaways:

➤ Grok 4.5 performs very strongly on agentic tasks. Grok 4.5 ranks #4 on GDPval-AA v2 with an Elo of 1543, between Claude Opus 4.8 (1600) and GLM-5.2 (1513). It achieves the top score on 𝜏³-Banking of 33%, above 31% from GPT-5.5 (xhigh), and sits on the cost vs performance Pareto frontier across all three agentic evaluations in the Intelligence Index
➤ Grok 4.5 is one of the most cost efficient models to run for near-frontier intelligence. It costs $0.31 per task on the Artificial Analysis Intelligence Index and $2.59 per task on the Artificial Analysis Coding Agent Index within Grok Build
➤ Low cost for Grok 4.5 is driven by both low pricing and token efficiency. Grok 4.5 has a headline price over 60% lower than Claude Opus 4.8 and GPT-5.5, and used ~14k output tokens per Intelligence Index Task - over 60% lower than Opus 4.8. On the Coding Agent Index, Grok 4.5 stands out on the Pareto frontier of Coding Agent Index score vs. Total Tokens, using only 1.9M tokens for the Coding Agent Index while scoring 76
➤ As a coding agent, Grok 4.5 in Grok Build is on par with GPT-5.5 and offers efficiency benefits: In our Artificial Intelligence Coding Agent Index that consists of DeepSWE, Terminal-Bench v2, and SWE-Atlas QnA, Grok 4.5 in Grok Build ranks third, on par with GPT-5.5 (Codex) and below Fable 5 (Claude Code). It is also very efficient in achieving this result: Grok 4.5 in Grok Build cost $2.49 per task while Fable 5 in Claude Code cost $11.80 and GPT-5.5 in Codex $5.07. This is driven by relatively low token pricing and the model using far fewer tokens than comparable models (1.9M average tokens used per task), significantly less than Fable 5 in Claude Code (7.2M) and GPT-5.5 in Codex (6.2M)

Other model details:

➤ Context window of 500k tokens - a reduction from Grok 4.3’s 1M token context, but retaining configurable reasoning and vision input

➤ Pricing of $2/$6 per 1M tokens of input/output; cache hits are discounted by 75% to $0.5 per 1M tokens, and costs still double with long (>200k token) inputs

➤ As Elon Musk has disclosed, Grok 4.5 is 3x larger than its predecessor at 1.5T parameters

Grok 4.5 performs very strongly on agentic tasks including knowledge work, terminal use, and customer service. Across GDPval-AA v2, 𝜏³-Banking, and Terminal-Bench v2.1, Grok 4.5 sits in line or ahead of Claude Opus 4.8 and GPT-5.5. We’ll be evaluating it soon on AA-Briefcase, our private benchmark of long-horizon knowledge work tasks across projects.

Grok 4.5 is highly cost-effective - it costs $0.31 to run per Intelligence Index task, which is less than GLM-5.2 and Kimi K2.6, and a 5x lower cost than Claude Sonnet 5 (max) while performing better on the Intelligence Index. This places Grok 4.5 firmly on the performance/cost Pareto frontier as a very attractive option for intelligence per dollar

Read 8 tweets

Artificial Analysis

@ArtificialAnlys

Apr 30

xAI has launched Grok 4.3, achieving 53 on the Artificial Analysis Intelligence Index with improved agentic performance, ~40% lower input price, and ~60% lower output price than Grok 4.20

The release of Grok 4.3 places @xAI just above Muse Spark and Claude Sonnet 4.6 on the Intelligence Index, and a 4 points ahead of the latest version of Grok 4.20. Grok 4.3 improves its Artificial Analysis Intelligence Index score while reducing cost to run the benchmark suite.

Key Takeaways:

➤ Grok 4.3 improves on cost-per-intelligence relative to Grok 4.20 0309 v2: it scores higher on the Intelligence Index while costing less to run the full benchmark suite. Grok 4.3 costs $395 to run the Artificial Analysis Intelligence Index, around 20% lower than Grok 4.20 0309 v2, despite using more output tokens. This makes it one of the lower-cost models at its intelligence level

➤ Large increase in real world agentic task performance: The largest single benchmark improvement is on GDPval-AA, where Grok 4.3 scores an ELO of 1500, up 321 points from Grok 4.20 0309 v2’s score of 1179 Grok 4.3, surpassing Gemini 3.1 Pro Preview, Muse Spark, Gpt-5.4 mini (xhigh), and Kimi K2.5. Grok 4.3 narrows the gap to the leading model on GDPval-AA, but still trails GPT-5.5 (xhigh) by 276 Elo points, with an expected win rate of ~17% against GPT-5.5 (xhigh) under the standard Elo formula

➤ Grok 4.3’s performs strongly on instruction following and agentic customer support tasks. It gains 5 points on 𝜏²-Bench Telecom to reach 98%, in line with GLM-5.1. Grok 4.3 maintains an 81% IFBench score from Grok 4.20 0309 v2

➤ Gains 8 points on AA-Omniscience Accuracy, but at the cost of lower AA-Omniscience Non-Hallucination Rate of 8 points, so Grok 4.20 0309 v2 still leads AA-Omniscience Non-Hallucination Rate, followed by MiMo-V2.5-Pro, in line with Grok 4.3

Congratulations to @xAI and @elonmusk on the impressive release!

This release shows increased cost efficiency to run the Artificial Analysis Intelligence Index, with Grok 4.3 sitting comfortably on the Pareto frontier for intelligence versus cost

Driven by 37.5% lower input token prices and 58.3% lower output token prices, it costs $395 to run the Intelligence Index evaluations, an overall ~20% decrease from Grok 4.20 0309 v2

Grok 4.3 uses ~44% more output tokens to run the Artificial Analysis Intelligence Index than Grok 4.20 0309 v2, but uses a similar number of tokens to models like Minimax M2.7 and remains less verbose than other leading models

Read 6 tweets

Artificial Analysis

@ArtificialAnlys

Feb 19

Google is once again the leader in AI: Gemini 3.1 Pro Preview leads the Artificial Analysis Intelligence Index, 4 points ahead of Claude Opus 4.6 while costing less than half as much to run

@GoogleDeepMind gave us pre-release access to Gemini 3.1 Pro Preview. It leads 6 of the 10 evaluations that make up the Artificial Analysis Intelligence Index and improves significantly over Gemini 3 Pro Preview across capabilities, with the biggest gains in reasoning and knowledge, coding, and hallucination reduction.

Gemini 3.1 Pro Preview also remains relatively token efficient, using ~57M tokens to run the Artificial Analysis Intelligence Index (+1M from Gemini 3 Pro Preview), lower than other frontier models at max reasoning settings such as Opus 4.6 (max) and GPT-5.2 (xhigh). Combined with lower per-token pricing, Gemini 3.1 Pro Preview is cost-efficient among frontier peers, costing less than half as much as Opus 4.6 (max) to run the full Intelligence Index, though still nearly 2x the leading open-weights model, GLM-5.

Key Takeaways:

➤ State-of-the-art intelligence at lower costs: Gemini 3.1 Pro Preview is leading 6 of the 10 evaluations that make up the Artificial Analysis Intelligence Index at less than half the cost to run of frontier peers from @OpenAI and @AnthropicAI. It obtains the highest score in Terminal-Bench Hard (agentic coding), AA-Omniscience (knowledge & hallucination), Humanity’s Last Exam (reasoning & knowledge), GPQA-Diamond (scientific reasoning), SciCode (coding) and CritPt (research-level physics). The CritPt score is particularly notable, scoring 18% on unpublished, research-level physics reasoning problems, over 5 p.p. above the next best model

➤ Improved real-world agentic performance, but not leading: Gemini 3.1 Pro Preview shows an improvement in GDPval-AA, our agentic evaluation focusing on real-world tasks, but is still not the leading model in this area. The model increases its ELO score over 100 points to 1316 (up from Gemini 3 Pro Preview), however still sits behind Claude Sonnet 4.6, Opus 4.6, GPT-5.2 (xhigh), and GLM-5

➤ Leading coding abilities: Gemini 3.1 Pro Preview leads the Artificial Analysis Coding Index, achieving the highest score in both Terminal-Bench Hard (54%) and SciCode (59%)

➤ Reduced hallucinations: Gemini 3.1 Pro Preview shows a major improvement in tendency to guess incorrectly when it doesn’t know the answer, reducing its AA-Omniscience hallucination rate by 38 p.p. from Gemini 3 Pro Preview

➤ Maintained token and cost efficiency: Gemini 3.1 Pro Preview improves without material increases in cost or token usage. It uses only ~2% more tokens to run the Artificial Analysis Intelligence Index than Gemini 3 Pro Preview, and keeps the same pricing ($2/$12 per 1M input/output tokens for ≤200k context). Its cost to run the Artificial Analysis Intelligence Index of $892 is less than half of frontier models such as Opus 4.6 (max) and GPT-5.2 (xhigh), though still ~2x the cost of leading open weights models such as GLM 5 ($547)

➤ Google takes top 3 spots in multi-modality: Gemini 3.1 Pro Preview ranks #1 on MMMU-Pro, our multimodal understanding and reasoning benchmark, ahead of Gemini 3 Pro Preview and Gemini 3 Flash, reinforcing Google’s leadership in multimodal reasoning

➤ Other model details: Gemini 3.1 Pro Preview retains the same 1 million token context window as its predecessor, and includes support for tool calling, structured outputs, and JSON mode

Gemini 3.1 Pro Preview improves without becoming more expensive or much more verbose, using only ~1M more tokens compared to Gemini 3 Pro Preview, representing a $72 increase in cost to run the Artificial Analysis Intelligence Index. This cost is less than half of frontier peers such as Opus 4.6 (max) and GPT-5.2 (xhigh), though still ~2x the cost of leading open-weights models such as GLM 5 and Kimi K2.5.

Gemini 3.1 Pro Preview has an average speed of 114 output tokens/s. Although slightly slower than its predecessor (-10 t/s), it remains one of the fastest models in the top 10 of the Artificial Analysis Intelligence Index, trailing only other Google models (Gemini 3 Flash and Gemini 3 Pro Preview).

Read 8 tweets

Artificial Analysis

@ArtificialAnlys

Dec 20, 2025

Xiaomi has just launched MiMo-V2-Flash, a 309B open weights reasoning model that scores 66 on the Artificial Analysis Intelligence Index. This release elevates Xiaomi to alongside other leading AI model labs.

Key benchmarking takeaways:

➤ Strengths in Agentic Tool Use and Competition Math: MiMo-V2-Flash scores 95% on τ²-Bench Telecom and 96% on AIME 2025, demonstrating strong performance on agentic tool-use workflows and competition-style mathematical reasoning. MiMo-V2-Flash currently leads the τ²-Bench Telecom category among evaluated models

➤ Cost competitive: The full Artificial Analysis evaluation suite cost just $53 to run. This is supported by MiMo-V2-Flash’s highly competitive pricing of $0.10 per million input and $0.30 per million output, making it particularly attractive for cost-sensitive deployments and large-scale production workloads. This is similar to DeepSeek V3.2 ($54 total cost to run), and well below GPT-5.2 ($1,294 total cost to run)

➤ High token usage: MiMo-V2-Flash is demonstrates high verbosity and token usage relative to other models in the same intelligence tier, using ~150M reasoning tokens across the Artificial Analysis Intelligence suite

➤ Open weights: MiMo-V2-Flash is open weights and is 309B parameters with 15B active at inference time. Weights are released under a MIT license, continuing the trend of Chinese AI model labs open sourcing their frontier models

See below for further analysis:

MiMo-V2-Flash demonstrates particular strength in agentic tool-use and Competition Math, scoring 95% on τ²-Bench Telecom and 96% on AIME 2025. This places it amongst the best performing models in these categories.

MiMo-V2-Flash is one of the most cost-effective models for its intelligence, priced at only $0.10 per million input tokens and $0.30 per million output tokens.

Read 7 tweets

Artificial Analysis

@ArtificialAnlys

Dec 10, 2025

Announcing Stirrup, our new open source framework for building agents. It’s lightweight, flexible, extensible and incorporates best-practices from leading agents like Claude Code

Stirrup differs from other agent frameworks by avoiding the rigidity that can degrade output quality. Stirrup lets models drive their own workflow, like Claude Code, while still giving developers structure and building in essential features like context management, MCP support and code execution. We use Stirrup at Artificial Analysis as part of our agentic benchmarks, including as part of our GDPval-AA evaluation being released later today. Just ‘pip install stirrup’ to start building your own agents today!

Key advantages:
➤ Works with the model, not against it: Stirrup steps aside and lets the model decide how to solve multi step tasks, as opposed to existing frameworks which impose strict patterns that limit performance.

➤ Best practices built in: We studied leading agent systems (e.g. Claude Code) to extract practical patterns around context handling, tool design, and workflow stability, and embedded those directly into the framework.

➤ Fully customizable: Use Stirrup as a package or as a starting template to build your own fully customized agents.

Feature highlights:
➤ Essential tools ready to use: Ships with pre built tools such as online search and browsing, code execution (local, docker, or using an @e2b sandbox), MCP client and document IO

➤ Flexible tool layer: A Generic Tool interface makes it simple to define and extend custom tools

➤ Context management: Automatic summarization to stay within context limits while preserving task fidelity

➤ Provider flexibility: Built in support for OpenAI compatible APIs (including @OpenRouterAI) and LiteLLM, or bring your own client

➤ Multimodal support: Process images, video, and audio with automatic format handling

Stirrup agents can be easily set up in just a few lines of code

Stirrup includes built in logging to help you observe and debug agents

Read 4 tweets

Artificial Analysis

@ArtificialAnlys

Dec 1, 2025

Introducing the Artificial Analysis Openness Index: a standardized and independently assessed measure of AI model openness across availability and transparency

Openness is not just the ability to download model weights. It is also licensing, data and methodology - we developed a framework underpinning the Artificial Analysis Openness Index to incorporate these elements. It allows developers, users, and labs to compare across all these aspects of openness on a standardized basis, and brings visibility to labs advancing the open AI ecosystem.

A model with a score of 100 in Openness Index would be open weights and permissively licensed with full training code, pre-training data and post-training data released - allowing users to not just use the model but reproduce its training in full, or take inspiration from some or all of the model creator’s approach to build their own model. We have not yet awarded any models a score of 100!

Key details:
🔒 Few models and providers take a fully open approach. We see a strong and growing ecosystem of open weights models, including leading models from Chinese labs such as Kimi K2, Minimax M2, and DeepSeek V3.2. However, releases of data and methodology are much rarer - OpenAI’s gpt-oss family is a prominent example of open weights and Apache 2.0 licensing, but minimal disclosure otherwise.

🥇 OLMo from @allen_ai leads the Openness Index at launch. Living up to AI2’s mission to provide ‘truly open’ research, the OLMo family achieves the top score of 89 (16 of a maximum of 18 points) on the Index by prioritizing full replicability and permissive licensing across weights, training data, and code. With the recent launch of OLMo 3, this included the latest version of AI2’s data, utilities and software, full details on reasoning model training, and the new Dolci post-training dataset.

🥈 NVIDIA’s Nemotron family also performs strongly for openness. @NVIDIAAI models such as NVIDIA Nemotron Nano 9B v2 reach a score of 67 on the Index due to their release alongside extensive technical reports detailing their training process, open source tooling for building models like them, and the Nemotron-CC and Nemotron post-training datasets.

📉 We’re tracking both open weights and closed weights models. Openness Index is a new way to think about how open models are, and we will be ranking closed weights models alongside open weights models to recognize the scope of methodology and data transparency associated with closed model releases.

Methodology & Context:
➤ We analyze openness using a standardized framework covering model availability (weights & license) and model transparency (data and methodology). This means we capture not just how freely a model can be used, but visibility into its training and knowledge, and potential to replicate or build on its capabilities or data.

➤ Model availability is measured based on the access and licensing of the model/weights themselves, while transparency comprises subcomponents for access and licensing for methodology, pre-training data, and post-training data.

➤ As seen with releases like DeepSeek R1, sharing methodology accelerates progress. We hope the Index encourages labs to balance competitive moats with the benefits of sharing the "how" alongside the "what."

➤ AI model developers may choose not to fully open their models for a wide range of reasons. We feel strongly that there are important advantages to the open AI ecosystem and supporting the open ecosystem is a key reason we developed the Openness Index. We do not, however, wish to dismiss the legitimacy of the tradeoffs that greater openness comes with, and we do not intend to treat Openness Index as a strictly ‘higher is better’ scale.

See below for further analysis and details 👇

The Openness Index breaks down a total of 18 points across the four subcomponents, and we then represent the overall value on a normalized 0-100 scale. We will continue to review and iterate this framework as the model ecosystem develops and new factors emerge.

In today’s model landscape, transparency is much rarer than availability. While we see a wide range of models with open weights and permissive licensing, nearly all are clustered in the top left quadrant of the chart with lower-end transparency. This reflects the current state of the ecosystem - many models have open weights, but few have open data or methodologies.

Read 5 tweets

Share this page!

Enter URL or ID to Unroll

Artificial Analysis

Try unrolling a thread yourself!

More from @ArtificialAnlys

Artificial Analysis

Artificial Analysis

Artificial Analysis

Artificial Analysis

Artificial Analysis

Artificial Analysis

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!