DeepSeek’s R1 leaps over xAI, Meta and Anthropic to be tied as the world’s #2 AI Lab and the undisputed open-weights leader
DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index, our index of 7 leading evaluations that we run independently across all leading models. That’s the same magnitude of increase as the difference between OpenAI’s o1 and o3 (62 to 70).
This positions DeepSeek R1 as higher intelligence than xAI’s Grok 3 mini (high), NVIDIA’s Llama Nemotron Ultra, Meta’s Llama 4 Maverick, Alibaba’s Qwen 3 253 and equal to Google’s Gemini 2.5 Pro.
Breakdown of the model’s improvement:
🧠 Intelligence increases across the board: Biggest jumps seen in AIME 2024 (Competition Math, +21 points), LiveCodeBench (Code generation, +15 points), GPQA Diamond (Scientific Reasoning, +10 points) and Humanity’s Last Exam (Reasoning & Knowledge, +6 points)
🏠 No change to architecture: R1-0528 is a post-training update with no change to the V3/R1 architecture - it remains a large 671B model with 37B active parameters
🧑💻 Significant leap in coding skills: R1 is now matching Gemini 2.5 Pro in the Artificial Analysis Coding Index and is behind only o4-mini (high) and o3
🗯️ Increased token usage: R1-0528 used 99 million tokens to complete the evals in Artificial Analysis Intelligence Index, 40% more than the original R1’s 71 million tokens - ie. the new R1 thinks for longer than the original R1. This is still not the highest token usage number we have seen: Gemini 2.5 Pro is using 30% more tokens than R1-0528
Takeaways for AI:
👐 The gap between open and closed models is smaller than ever: open weights models have continued to maintain intelligence gains in-line with proprietary models. DeepSeek’s R1 release in January was the first time an open-weights model achieved the #2 position and DeepSeek’s R1 update today brings it back to the same position
🇨🇳 China remains neck and neck with the US: models from China-based AI Labs have all but completely caught up to their US counterparts, this release continues the emerging trend. As of today, DeepSeek leads US based AI labs including Anthropic and Meta in Artificial Analysis Intelligence Index
🔄 Improvements driven by reinforcement learning: DeepSeek has shown substantial intelligence improvements with the same architecture and pre-train as their original DeepSeek R1 release. This highlights the continually increasing importance of post-training, particularly for reasoning models trained with reinforcement learning (RL) techniques. OpenAI disclosed a 10x scaling of RL compute between o1 and o3 - DeepSeek have just demonstrated that so far, they can keep up with OpenAI’s RL compute scaling. Scaling RL demands less compute than scaling pre-training and offers an efficient way of achieving intelligence gains, supporting AI Labs with fewer GPUs
See further analysis below 👇
DeepSeek has maintained its status as amongst AI labs leading in frontier AI intelligence
Today’s DeepSeek R1 update is substantially more verbose in its responses (including considering reasoning tokens) than the January release. DeepSeek R1 May used 99M tokens to run the 7 evaluations in our Intelligence Index, +40% more tokens than the prior release
Congratulations to @FireworksAI_HQ , @parasail_io , @novita_labs , @DeepInfra , @hyperbolic_labs , @klusterai , @deepseek_ai and @nebiusai on being fast to launch endpoints
Xiaomi has just launched MiMo-V2-Flash, a 309B open weights reasoning model that scores 66 on the Artificial Analysis Intelligence Index. This release elevates Xiaomi to alongside other leading AI model labs.
Key benchmarking takeaways:
➤ Strengths in Agentic Tool Use and Competition Math: MiMo-V2-Flash scores 95% on τ²-Bench Telecom and 96% on AIME 2025, demonstrating strong performance on agentic tool-use workflows and competition-style mathematical reasoning. MiMo-V2-Flash currently leads the τ²-Bench Telecom category among evaluated models
➤ Cost competitive: The full Artificial Analysis evaluation suite cost just $53 to run. This is supported by MiMo-V2-Flash’s highly competitive pricing of $0.10 per million input and $0.30 per million output, making it particularly attractive for cost-sensitive deployments and large-scale production workloads. This is similar to DeepSeek V3.2 ($54 total cost to run), and well below GPT-5.2 ($1,294 total cost to run)
➤ High token usage: MiMo-V2-Flash is demonstrates high verbosity and token usage relative to other models in the same intelligence tier, using ~150M reasoning tokens across the Artificial Analysis Intelligence suite
➤ Open weights: MiMo-V2-Flash is open weights and is 309B parameters with 15B active at inference time. Weights are released under a MIT license, continuing the trend of Chinese AI model labs open sourcing their frontier models
See below for further analysis:
MiMo-V2-Flash demonstrates particular strength in agentic tool-use and Competition Math, scoring 95% on τ²-Bench Telecom and 96% on AIME 2025. This places it amongst the best performing models in these categories.
MiMo-V2-Flash is one of the most cost-effective models for its intelligence, priced at only $0.10 per million input tokens and $0.30 per million output tokens.
Announcing Stirrup, our new open source framework for building agents. It’s lightweight, flexible, extensible and incorporates best-practices from leading agents like Claude Code
Stirrup differs from other agent frameworks by avoiding the rigidity that can degrade output quality. Stirrup lets models drive their own workflow, like Claude Code, while still giving developers structure and building in essential features like context management, MCP support and code execution. We use Stirrup at Artificial Analysis as part of our agentic benchmarks, including as part of our GDPval-AA evaluation being released later today. Just ‘pip install stirrup’ to start building your own agents today!
Key advantages:
➤ Works with the model, not against it: Stirrup steps aside and lets the model decide how to solve multi step tasks, as opposed to existing frameworks which impose strict patterns that limit performance.
➤ Best practices built in: We studied leading agent systems (e.g. Claude Code) to extract practical patterns around context handling, tool design, and workflow stability, and embedded those directly into the framework.
➤ Fully customizable: Use Stirrup as a package or as a starting template to build your own fully customized agents.
Feature highlights:
➤ Essential tools ready to use: Ships with pre built tools such as online search and browsing, code execution (local, docker, or using an @e2b sandbox), MCP client and document IO
➤ Flexible tool layer: A Generic Tool interface makes it simple to define and extend custom tools
➤ Context management: Automatic summarization to stay within context limits while preserving task fidelity
➤ Provider flexibility: Built in support for OpenAI compatible APIs (including @OpenRouterAI) and LiteLLM, or bring your own client
➤ Multimodal support: Process images, video, and audio with automatic format handling
Stirrup agents can be easily set up in just a few lines of code
Stirrup includes built in logging to help you observe and debug agents
Introducing the Artificial Analysis Openness Index: a standardized and independently assessed measure of AI model openness across availability and transparency
Openness is not just the ability to download model weights. It is also licensing, data and methodology - we developed a framework underpinning the Artificial Analysis Openness Index to incorporate these elements. It allows developers, users, and labs to compare across all these aspects of openness on a standardized basis, and brings visibility to labs advancing the open AI ecosystem.
A model with a score of 100 in Openness Index would be open weights and permissively licensed with full training code, pre-training data and post-training data released - allowing users to not just use the model but reproduce its training in full, or take inspiration from some or all of the model creator’s approach to build their own model. We have not yet awarded any models a score of 100!
Key details:
🔒 Few models and providers take a fully open approach. We see a strong and growing ecosystem of open weights models, including leading models from Chinese labs such as Kimi K2, Minimax M2, and DeepSeek V3.2. However, releases of data and methodology are much rarer - OpenAI’s gpt-oss family is a prominent example of open weights and Apache 2.0 licensing, but minimal disclosure otherwise.
🥇 OLMo from @allen_ai leads the Openness Index at launch. Living up to AI2’s mission to provide ‘truly open’ research, the OLMo family achieves the top score of 89 (16 of a maximum of 18 points) on the Index by prioritizing full replicability and permissive licensing across weights, training data, and code. With the recent launch of OLMo 3, this included the latest version of AI2’s data, utilities and software, full details on reasoning model training, and the new Dolci post-training dataset.
🥈 NVIDIA’s Nemotron family also performs strongly for openness. @NVIDIAAI models such as NVIDIA Nemotron Nano 9B v2 reach a score of 67 on the Index due to their release alongside extensive technical reports detailing their training process, open source tooling for building models like them, and the Nemotron-CC and Nemotron post-training datasets.
📉 We’re tracking both open weights and closed weights models. Openness Index is a new way to think about how open models are, and we will be ranking closed weights models alongside open weights models to recognize the scope of methodology and data transparency associated with closed model releases.
Methodology & Context:
➤ We analyze openness using a standardized framework covering model availability (weights & license) and model transparency (data and methodology). This means we capture not just how freely a model can be used, but visibility into its training and knowledge, and potential to replicate or build on its capabilities or data.
➤ Model availability is measured based on the access and licensing of the model/weights themselves, while transparency comprises subcomponents for access and licensing for methodology, pre-training data, and post-training data.
➤ As seen with releases like DeepSeek R1, sharing methodology accelerates progress. We hope the Index encourages labs to balance competitive moats with the benefits of sharing the "how" alongside the "what."
➤ AI model developers may choose not to fully open their models for a wide range of reasons. We feel strongly that there are important advantages to the open AI ecosystem and supporting the open ecosystem is a key reason we developed the Openness Index. We do not, however, wish to dismiss the legitimacy of the tradeoffs that greater openness comes with, and we do not intend to treat Openness Index as a strictly ‘higher is better’ scale.
See below for further analysis and details 👇
The Openness Index breaks down a total of 18 points across the four subcomponents, and we then represent the overall value on a normalized 0-100 scale. We will continue to review and iterate this framework as the model ecosystem develops and new factors emerge.
In today’s model landscape, transparency is much rarer than availability. While we see a wide range of models with open weights and permissive licensing, nearly all are clustered in the top left quadrant of the chart with lower-end transparency. This reflects the current state of the ecosystem - many models have open weights, but few have open data or methodologies.
Anthropic’s new Claude Opus 4.5 is the #2 most intelligent model in the Artificial Analysis Intelligence Index, narrowly behind Google’s Gemini 3 Pro and tying OpenAI’s GPT-5.1 (high)
Claude Opus 4.5 delivers a substantial intelligence uplift over Claude Sonnet 4.5 (+7 points on the Artificial Analysis Intelligence Index) and Claude Opus 4.1 (+11 points), establishing it as @AnthropicAI's new leading model. Anthropic has dramatically cut per-token pricing for Claude Opus 4.5 to $5/$25 per million input/output tokens. However, compared to the prior Claude Opus 4.1 model it used 60% more tokens to complete our Intelligence Index evaluations (48M vs. 30M). This translates to a substantial reduction in the cost to run our Intelligence Index evaluations from $3.1k to $1.5k, but not as significant as the headline price cut implies. Despite Claude Opus 4.5 using substantially more tokens to complete our Intelligence Index, the model still cost significantly more than other models including Gemini 3 Pro (high), GPT-5.1 (high), and Claude Sonnet 4.5 (Thinking), and among all models only cost less than Grok 4 (Reasoning).
Key benchmarking takeaways:
➤ 🧠 Anthropic’s most intelligent model: In reasoning mode, Claude Opus 4.5 scores 70 on the Artificial Analysis Intelligence Index. This is a jump of +7 points from Claude Sonnet 4.5 (Thinking), which was released in September 2025, and +11 points from Claude Opus 4.1 (Thinking). Claude Opus 4.5 is now the second most intelligent model. It places ahead of Grok 4 (65) and Kimi K2 Thinking (67), ties GPT-5.1 (high, 70), and trails only Gemini 3 Pro (73). Claude Opus 4.5 (Thinking) scores 5% on CritPt, a frontier physics eval reflective of research assistant capabilities. It sits only behind Gemini 3 Pro (9%) and ties GPT-5.1 (high, 5%)
➤ 📈 Largest increases in coding and agentic tasks: Compared to Claude Sonnet 4.5 (Thinking), the biggest uplifts appear across coding, agentic tasks, and long-context reasoning, including LiveCodeBench (+16 p.p.), Terminal-Bench Hard (+11 p.p.), 𝜏²-Bench Telecom (+12 p.p.), AA-LCR (+8 p.p.), and Humanity's Last Exam (+11 p.p.). Claude Opus achieves Anthropic’s best scores yet across all 10 benchmarks in the Artificial Analysis Intelligence Index. It also earns the highest score on Terminal-Bench Hard (44%) of any model and ties Gemini 3 Pro on MMLU-Pro (90%)
➤ 📚 Knowledge and Hallucination: In our recently launched AA-Omniscience Index, which measures embedded knowledge and hallucination of language models, Claude Opus 4.5 places 2nd with a score of 10. It sits only behind Gemini 3 Pro Preview (13) and ahead of Claude Opus 4.1 (Thinking, 5) and GPT-5.1 (high, 2). Claude Opus 4.5 (Thinking) scores the second-highest accuracy (43%) and has the 4th-lowest hallucination rate (58%), trailing only Claude Haiku (Thinking, 26%), Claude Sonnet 4.5 (Thinking, 48%), and GPT-5.1 (high). Claude Opus 4.5 continues to demonstrate Anthropic’s leadership in AI safety with a lower hallucination rate than select other frontier models such as Grok 4 and Gemini 3 Pro
➤ ⚡ Non-reasoning performance: In non-reasoning mode, Claude Opus 4.5 scores 60 on the Artificial Analysis Intelligence Index and is the most intelligent non-reasoning model. It places ahead of Qwen3 Max (55), Kimi K2 0905 (50), and Claude Sonnet 4.5 (50)
➤ ⚙️ Token efficiency: Anthropic continues to demonstrate impressive token efficiency. It has improved intelligence without a significant increase in token usage (compared to Claude Sonnet 4.5, evaluated with a maximum reasoning budget of 64k tokens). Claude Opus 4.5 uses 48M output tokens to run the Artificial Analysis Intelligence Index. This is lower than other frontier models, such as Gemini 3 Pro (high, 92M), GPT-5.1 (high, 81M), and Grok 4 (Reasoning, 120M)
➤ 💲 Pricing: Anthropic has reduced the per-token pricing of Claude Opus 4.5 compared to Claude Opus 4.1. Claude Opus 4.5 is priced at $5/$25 per 1M input/output tokens (vs. $15/$75 for Claude Opus 4.1). This positions it much closer to Claude Sonnet 4.5 ($3/$15 per 1M tokens) while offering higher intelligence in thinking mode
Key model details:
➤ 📏 Context window: 200K tokens
➤ 🪙 Max output tokens: 64K tokens
➤ 🌐 Availability: Claude Opus 4.5 is available via Anthropic‘s API, Google Vertex, Amazon Bedrock and Microsoft Azure. Claude Opus 4.5 is also available via Claude app and Claude Code
A key differentiator for the Claude models remains that they are substantially more token-efficient than all other reasoning models. Claude Opus 4.5 has significantly increased intelligence without a large increase in output tokens, differing substantially from other model families that rely on greater reasoning at inference time (i.e., more output tokens). On the Output Tokens Used in Artificial Analysis Intelligence Index vs Intelligence Index chart, Claude 4.5 Opus (Thinking) sits on the Pareto frontier.
This output token efficiency contributes to Claude Opus 4.5 (in Thinking mode) offering a better tradeoff between intelligence and cost to run the Artificial Analysis Intelligence Index than Claude Opus 4.1 (Thinking) and Grok 4 (Reasoning).
Gemini 3 Pro is the new leader in AI. Google has the leading language model for the first time, with Gemini 3 Pro debuting +3 points above GPT-5.1 in our Artificial Analysis Intelligence Index
@GoogleDeepMind gave us pre-release access to Gemini 3 Pro Preview. The model outperforms all other models in Artificial Analysis Intelligence Index. It demonstrates strength across the board, coming in first in 5 of the 10 evaluations that make up Intelligence Index. Despite these intelligence gains, Gemini 3 Pro Preview shows improved token efficiency from Gemini 2.5 Pro, using significantly fewer tokens on the Intelligence Index than other leading models such as Kimi K2 Thinking and Grok 4. However, given its premium pricing ($2/$12 per million input/output tokens for <200K context), Gemini 3 Pro is among the most expensive models to run our Intelligence Index evaluations.
Key takeaways:
📖 Leading intelligence: Gemini 3 Pro Preview is the leading model in 5 of 10 evals in the Artificial Analysis Intelligence Index, including GPQA Diamond, MMLU-Pro, HLE, LiveCodeBench and SciCode. Its score of 37% on Humanity’s Last Exam is particularly impressive, improving on the previous best model by more than 10 percentage points. It also is leading in AA-Omniscience, Artificial Analysis’ new knowledge and hallucination evaluation, coming first in both Omniscience Index (our lead metric that takes off points for incorrect answers) and Omniscience Accuracy (percentage correct). Given that factual recall correlates closely with model size, this may point to Gemini 3 Pro being a much larger model than its competitors
💻 Advanced coding and agentic capabilities: Gemini 3 Pro Preview leads two of the three coding evaluations in the Artificial Analysis Intelligence Index, including an impressive 56% in SciCode, an improvement of over 10 percentage points from the previous highest score. It is also strong in agentic contexts, achieving the second highest score in Terminal-Bench Hard and Tau2-Bench Telecom
🖼️ Multimodal capabilities: Gemini 3 Pro Preview is a multi-modal model, with the ability to take text, images, video and audio as input. It scores the highest of any model on MMMU-Pro, a benchmark that tests reasoning abilities with image inputs. Google now occupies the first, third and fourth position in our MMMU-Pro leaderboard (with GPT-5.1 taking out second place just last week)
💲Premium Pricing: To measure cost, we report Cost to Run the Artificial Analysis Intelligence Index, which combines input and output token prices with token efficiency to reflect true usage cost. Despite the improvement in token efficiency from Gemini 2.5 Pro, Gemini 3 Pro Preview costs more to run. Its higher token pricing of $2/$12 USD per million input/output tokens (≤200k token context) results in a 12% increase in the cost to run the Artificial Analysis Intelligence Index compared to its predecessor, and the model is among the most expensive to run on our Intelligence Index. Google also continues to price long context workloads higher than lower context workloads, charging $4/$18 per million input/output tokens for ≥200k token context.
⚡ Speed: Gemini 3 Pro Preview has comparable speeds to Gemini 2.5 Pro, with 128 output tokens per second. This places it ahead of other frontier models including GPT-5.1 (high), Kimi K2 Thinking and Grok 4. This is potentially supported by Google’s first-party TPU accelerators
Other details: Gemini 3 Pro Preview has a 1 million token context window, and includes support for tool calling, structured outputs, and JSON mode
See below for further analysis
For the first time, Google has the most intelligent model, with Gemini 3 Pro Preview improving on the previous most intelligent model, OpenAI’s GPT-5.1 (high), by 3 points
Gemini 3 Pro Preview takes the top spot on the Artificial Analysis Omniscience Index, our new benchmark for measuring knowledge and hallucination across domains. Gemini 3 Pro Preview comes in first for both Omniscience Index (our lead metric that takes off points for incorrect answers) and Omniscience Accuracy (percentage correct).
Its win in Accuracy is actually much larger than than its overall Index win - this is driven by a higher Hallucination Rate than other models (88%).
We have previously shown that Omniscience Accuracy is closely correlated with model size (total parameter count). Gemini 3 Pro’s significant lead in this metric may point to it being a much larger model than its competitors.
Announcing AA-Omniscience, our new benchmark for knowledge and hallucination across >40 topics, where all but three models are more likely to hallucinate than give a correct answer
Embedded knowledge in language models is important for many real world use cases. Without knowledge, models make incorrect assumptions and are limited in their ability to operate in real world contexts. Tools like web search can support but models need to know what to search for (e.g. models should not search for ‘Multi Client Persistence’ for an MCP query when it clearly refers to ‘Model Context Protocol’).
Hallucination of factual information is a barrier to being able to rely on models and has been perpetuated by every major evaluation dataset. Grading correct answers with no penalty for incorrect answers creates an incentive for models (and the labs training them) to attempt every question. This problem is clearest when it comes to knowledge: factual information should never be made up, while in other contexts attempts that might not work are useful (e.g. coding new features).
Omniscience Index is the the key metric we report for AA-Omniscience, and it punishes hallucinations by deducting points where models have guessed over admitting they do not know the answer. AA-Omniscience shows that all but three models are more likely to hallucinate than provide a correct answer when given a difficult question. AA-Omniscience will complement the Artificial Analysis Intelligence Index to incorporate measurement of knowledge and probability of hallucination.
Details below, and more charts in the thread.
AA-Omniscience details:
- 🔢6,000 questions across 42 topics within 6 domains (’Business’, ‘Humanities & Social Sciences’, ‘Health’, ‘Law’, ‘Software Engineering’, and ‘Science, Engineering & Mathematics’)
- 🔍 89 sub-topics including Python data libraries, Public Policy, Taxation, and more, giving a sharper view of where models excel and where they fall short across nuanced domains
- 🔄 Incorrect answers are penalized in our Knowledge Reliability Index metrics to punish hallucinations
- 📊3 Metrics: Accuracy (% correct), Hallucination rate (% incorrect of incorrect/abstentions), Omniscience Index (+1 for correct, -1 for incorrect where answered, 0 for abstentions where the model did not try to answer)
- 🤗 Open source test dataset: We’re open sourcing 600 questions (10%) to support labs develop factual and reliable models. Topic distribution and model performance follows the full set (@huggingface link below)
- 📃 Paper: See below for a link to the research paper
Key findings:
- 🥇 Claude 4.1 Opus takes first place in Omniscience Index, followed by last week’s GPT-5.1 and Grok 4: Even the best frontier models score only slightly above 0, meaning they produce correct answers on the difficult questions that make up AA-Omniscience only marginally more often than incorrect ones. @AnthropicAI’s leadership is driven by low hallucination rate, whereas OpenAI and xAI’s positions are primarily driven by higher accuracy (percentage correct).
- 🥇 xAI’s Grok 4 takes first place in Omniscience Accuracy (our simple ‘percentage correct’ metric), followed by GPT-5 and Gemini 2.5 Pro: @xai's win may be enabled by scaling total parameters and pre-training compute: @elonmusk revealed last week that Grok 4 has 3 trillion total parameters, which may be larger than GPT-5 and other proprietary models
- 🥇 Claude sweeps the hallucination leaderboard: Anthropic takes the top three spots for lowest hallucination rate, with Claude 4.5 Haiku leading at 28%, over three times lower than GPT-5 (high) and Gemini 2.5 Pro. Claude 4.5 Sonnet and Claude 4.1 Opus follow in second and third at 48%
- 💭 High knowledge does not guarantee low hallucination: Hallucination rate measures how often a model guesses when it lacks the required knowledge. Models with the highest accuracy, including the GPT-5 models and Gemini 2.5 Pro, do not lead the Omniscience Index due to their tendency to guess over abstaining. Anthropic models tend to manage uncertainty better, with Claude 4.5 Haiku achieving the lowest hallucination rate at 26%, ahead of 4.5 Sonnet and 4.1 Opus (48%)
- 📊 Models vary by domain: Models differ in their performance across the six domains of AA-Omniscience - no model dominates across all. While Anthropic’s Claude 4.1 Opus leads in Law, Software Engineering, and Humanities & Social Sciences, GPT-5.1 from @OpenAI achieves the highest reliability on Business questions, and xAI’s Grok 4 performs best in Health and in Science, Engineering & Mathematics. Model choice should align with the the use case rather than choosing the overall leader
- 📈 Larger models score higher on accuracy, but not always reliability: Larger models tend to have higher levels of embedded knowledge, with Kimi K2 Thinking and DeepSeek R1 (0528) topping accuracy charts over smaller models. This advantage does not always hold on the Omniscience Index. For example, Llama 3.1 405B from @AIatMeta beats larger Kimi K2 variants due to having one of the lowest hallucination rates among models (51%)
Grok 4 by @xai, GPT-5 by @OpenAI and Gemini 2.5 Pro by @GoogleDeepMind achieve the highest accuracy in AA-Omniscience. The reason they do not achieve the highest Omniscience Index due to the low hallucination rates of @AnthropicAI’s Claude models
@AnthropicAI takes the top three spots for lowest hallucination rate, with Claude 4.5 Haiku leading at 28%, over three times lower than GPT-5 (high) and Gemini 2.5 Pro. Claude 4.5 Sonnet and Claude 4.1 Opus follow in second and third at 48%