Independent benchmarks of OpenAI’s gpt-oss models: gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits
OpenAI has released two versions of gpt-oss:
➤ gpt-oss-120b (116.8B total parameters, 5.1B active parameters): Intelligence Index score of 58
➤ gpt-oss-20b (20.9B total parameters, 3.6B active parameters): Intelligence Index score of 48
Size & deployment: OpenAI has released both models in MXFP4 precision: gpt-oss-120b comes in at just 60.8GB and gpt-oss-20b just 12.8GB. This means that the 120B can be run in its native precision on a single NVIDIA H100, and the 20B can be run easily on a consumer GPU or laptop with >16GB of RAM. Additionally, the relatively small proportion of active parameters will contribute to their efficiency and speed for inference: the 5.1B active parameters of the 120B model can be contrasted with Llama 4 Scout’s 109B total parameters and 17B active (a lot less sparse). This makes it possible to get dozens of output tokens/s for the 20B on recent MacBooks.
Intelligence: Both models score extremely well for their size and sparsity. We’re seeing the 120B beat o3-mini but come in behind o4-mini and o3. The 120B is the most intelligent model that can be run on a single H100 and the 20B is the most intelligent model that can be run on a consumer GPU. Both models appear to place similiarly across most of our evals, indicating no particular areas of weakness.
Comparison to other open weights models: While the larger gpt-oss-120b does not come in above DeepSeek R1 0528’s score of 59 or Qwen3 235B 2507s score of 64, it is notable that it is significantly smaller in both total and active parameters than both of those models. DeepSeek R1 has 671B total parameters and 37B active parameters, and is released natively in FP8 precision, making its total file size (and memory requirements) over 10x larger than gpt-oss-120b. Both gpt-oss-120b and 20b are text-only models (similar to competing models from DeepSeek, Alibaba and others).
Architecture: The MoE architecture at appears fairly standard. The MoE router selects the top 4 experts for each token generation. The 120B has 36 layers and 20B has 24 layers. Each layer has 64 query heads, uses Grouped Query Attention with 8 KV heads. Rotary embeddings and YaRN are used to extend context window to 128k. The 120B model activates 4.4% of total parameters per forward pass, whereas the 20B model activates 17.2% of total parameters. This may indicate that OpenAI’s perspective is that a higher degree is of sparsity is optimal for larger models. It has been widely speculated that most top models from frontier labs have been sparse MoEs for most releases since GPT-4.
API Providers: A number of inference providers have been quick to launch endpoints. We are currently benchmarking @GroqInc, @CerebrasSystems, @FireworksAI_HQ and @togethercompute on Artificial Analysis and will add more providers as they launch endpoints.
Pricing: We’re tracking median pricing across API providers of $0.15/$0.69 per million input/output tokens for the 120B and $0.08/$0.35 for the 20B. These prices put the 120B close to 10x cheaper than OpenAI’s proprietary APIs for o4-mini ($1.1/$4.4) and o3 ($2/$8).
License: Apache 2.0 license - very permissive!
See below for further analysis:
Intelligence vs. Total Parameters: gpt-oss-120B is the most intelligence model that can fit on a single H100 GPU in its native precision.
Pricing: Across the API providers who have launched day one API coverage, we’re seeing median prices of $0.15/$0.69 per million input/output tokens for the 120B and $0.08/$0.35 for the 20B. This makes both gpt-oss models highly cost efficient options for developers.
Output token usage: Relative to other reasoning models, both models are quite efficient even in their ‘high’ reasoning modes, particularly gpt-oss-120b which used only 21M tokens to run our Artificial Analysis Intelligence Index benchmarks. This is 1/4 of the tokens o4-mini (high) took to run the same benchmarks, 1/2 of o3 and less than Kimi K2 (a non-reasoning model).
Individual evaluation results from benchmarks we have run independently:
OpenAI gave us early access to GPT-5: our independent benchmarks verify a new high for AI intelligence. We have tested all four GPT-5 reasoning effort levels, revealing 23x differences in token usage and cost between the ‘high’ and ‘minimal’ options and substantial differences in intelligence
We have run our full suite of eight evaluations independently across all reasoning effort configurations of GPT-5 and are reporting benchmark results for intelligence, token usage, and end-to-end latency.
What @OpenAI released: OpenAI has released a single endpoint for GPT-5, but different reasoning efforts offer vastly different intelligence. GPT-5 with reasoning effort “High” reaches a new intelligence frontier, while “Minimal” is near GPT-4.1 level (but more token efficient).
Takeaways from our independent benchmarks:
⚙️ Reasoning effort configuration: GPT-5 offers four reasoning effort configurations: high, medium, low, and minimal. Reasoning effort options steer the model to “think” more or less hard for each query, driving large differences in intelligence, token usage, speed, and cost.
🧠 Intelligence achieved ranges from frontier to GPT-4.1 level: GPT-5 sets a new standard with a score of 68 on our Artificial Analysis Intelligence Index (MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME, IFBench & AA-LCR) at High reasoning effort. Medium (67) is close to o3, Low (64) sits between DeepSeek R1 and o3, and Minimal (44) is close to GPT-4.1. While High sets a new standard, the increase over o3 is not comparable to the jump from GPT-3 to GPT-4 or GPT-4o to o1.
💬 Token usage varies 23x between reasoning efforts: GPT-5 with High reasoning effort used more tokens than o3 (82M vs. 50M) to complete our Index, but still fewer than Gemini 2.5 Pro (98M) and DeepSeek R1 0528 (99M). However, Minimal reasoning effort used only 3.5M tokens which is substantially less than GPT-4.1, making GPT-5 Minimal significantly more token-efficient for similar intelligence.
📖 Long Context Reasoning: We released our own Long Context Reasoning (AA-LCR) benchmark earlier this week to test the reasoning capabilities of models across long sequence lengths (sets of documents ~100k tokens in total). GPT-5 stands out for its performance in AA-LCR, with GPT-5 in both High and Medium reasoning efforts topping the benchmark.
🤖 Agentic Capabilities: OpenAI also commented on improvements across capabilities increasingly important to how AI models are used, including agents (long horizon tool calling). We recently added IFBench to our Intelligence Index to cover instruction following and will be adding further evals to cover agentic tool calling to independently test these capabilities.
📡 Vibe checks: We’re testing the personality of the model through MicroEvals on our website which supports running the same prompt across models and comparing results. It’s free to use, we’ll provide an update with our perspective shortly but feel free to share your own!
See below for further analysis:
Token usage (verbosity): GPT-5 with reasoning effort high uses 23X more tokens than with reasoning effort minimal. Though in doing so achieves substantial intelligence gains, between medium and high there is less of an uplift.
Individual intelligence benchmark results: GPT-5 performs well across our intelligence evaluations.
🇰🇷 South Korean AI Lab Upstage AI has just launched their first reasoning model - Solar Pro 2! The 31B parameter model demonstrates impressive performance for its size, with intelligence approaching Claude 4 Sonnet in 'Thinking' mode and is priced very competitively
Key details:
➤ Hybrid reasoning: The model offers optionality between 'reasoning' mode and standard non-reasoning mode
➤ Korean-language ability & Sovereign AI: Based in Korea, Upstage announced superior performance in Korean language evaluations. This release aligns with countries' interests to develop sovereign AI capabilities
➤ Pricing: Competitively priced at $0.5/1M tokens (input & output), significantly cheaper than comparable models including Claude 4 Sonnet Thinking ($3/$15/M input/output tokens) and Magistral Small ($0.5/$1.5/M input/output tokens)
➤ Proprietary: @upstageai has not released the model weights, though they have open-sourced previous Solar Pro models. Whether they will release Solar Pro 2's weights remains unclear as it wasn't mentioned in their announcement
Full suit of our independent intelligence evaluations:
Pricing is competitive for the intelligence Solar Pro 2 (Reasoning) offers
We’re releasing the Artificial Analysis AI Adoption Survey Report for H1 2025 based on >1,000 responses from developers, product managers and executives adopting AI
The Artificial Analysis AI Adoption Survey Report examines key trends in AI usage, analyzing adoption rates, primary use cases driving AI’s growth, and demand across chatbots, coding agents, LLM model families, providers, and chip companies.
A highlights version of the report is available for download on our website for a limited time.
We unpack 6 trends defining the adoption of AI for organizations in the first half of 2025:
1)⚡ AI has hit production: ~45% are using AI in production, while an additional 50% are prototyping or exploring uses with AI
2)💡 Engineering and R&D is the clear frontrunner use case: 66% are considering AI for Engineering/R&D, well ahead of the next most popular use cases in Customer Support and Sales & Marketing
3) 📈 Google, xAI, DeepSeek gain share while Meta and Mistral lose share: ~80% are using/considering Google Gemini, 53% DeepSeek & 31% xAI Grok marking a substantial increase in demand since 2024
4) 🔄 Companies are increasingly diversifying their AI use: Average number of LLMs used/considered has increased from ~2.8 in 2024 to ~4.7 in 2025, as organizations mature their AI use cases
5) 🏗️ Organizations are taking different approaches to Build vs. Buy: 32% of respondents favor building; 27% buying and 25% a hybrid approach
6) 🇨🇳 Organizations are open to Chinese models, if hosted outside of China: 55% would be willing to use LLMs from China-based AI labs, if hosted outside of China
The survey was conducted between April and June 2025, collecting responses from 1,000+ individuals across 90+ countries.
Below we share excerpts covering select important takeaways:
ChatGPT dominates AI chat adoption, followed by Gemini and Claude. Other notable players include Perplexity, xAI Grok and Microsoft Copilot
GitHub Copilot and Cursor dominate the market as the most popular AI coding tools, ahead of Claude Code and Gemini Code Assist (Note: the survey was conducted before the release of OpenAI Codex)
xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model.
We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68. Full results breakdown below.
This is the first time that @elonmusk's @xai has the lead the AI frontier. Grok 3 scored competitively with the latest models from OpenAI, Anthropic and Google - but Grok 4 is the first time that our Intelligence Index has shown xAI in first place.
We tested Grok 4 via the xAI API. The version of Grok 4 deployed for use on X/Twitter may be different to the model available via API. Consumer application versions of LLMs typically have instructions and logic around the models that can change style and behavior.
Grok 4 is a reasoning model, meaning it ‘thinks’ before answering. The xAI API does not share reasoning tokens generated by the model.
Grok 4’s pricing is equivalent to Grok 3 at $3/$15 per 1M input/output tokens ($0.75 per 1M cached input tokens). The per-token pricing is identical to Claude 4 Sonnet, but more expensive than Gemini 2.5 Pro ($1.25/$10, for <200K input tokens) and o3 ($2/$8, after recent price decrease). We expect Grok 4 to be available via the xAI API, via the Grok chatbot on X, and potentially via Microsoft Azure AI Foundry (Grok 3 and Grok 3 mini are currently available on Azure).
Key benchmarking results:
➤ Grok 4 leads in not only our Artificial Analysis Intelligence Index but also our Coding Index (LiveCodeBench & SciCode) and Math Index (AIME24 & MATH-500)
➤ All-time high score in GPQA Diamond of 88%, representing a leap from Gemini 2.5 Pro’s previous record of 84%
➤ All-time high score in Humanity’s Last Exam of 24%, beating Gemini 2.5 Pro’s previous all-time high score of 21%. Note that our benchmark suite uses the original HLE dataset (Jan '25) and runs the text-only subset with no tools
➤ Joint highest score for MMLU-Pro and AIME 2024 of 87% and 94% respectively
➤ Speed: 75 output tokens/s, slower than o3 (188 tokens/s), Gemini 2.5 Pro (142 tokens/s), Claude 4 Sonnet Thinking (85 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s)
Other key information:
➤ 256k token context window. This is below Gemini 2.5 Pro’s context window of 1 million tokens, but ahead of Claude 4 Sonnet and Claude 4 Opus (200k tokens), o3 (200k tokens) and R1 0528 (128k tokens)
➤ Supports text and image input
➤ Supports function calling and structured outputs
See below for further analysis 👇
Grok 4 scores higher in Artificial Analysis Intelligence Index than any other model. Its pricing is higher than OpenAI’s o3, Google’s Gemini 2.5 Pro and Anthropic’s Claude 4 Sonnet - but lower than Anthropic’s Claude 4 Opus and OpenAI’s o3-pro.
Full set of intelligence benchmarks that we have run independently on xAI’s Grok 4 API:
Google is firing on all cylinders across AI - Gemini 2.5 Pro is equal #2 in intelligence, Veo 3 and Imagen 4 are amongst the leaders in media generation, and with TPUs they're the only vertically integrated player
🧠 Google is now equal #2 Artificial Analysis Intelligence Index with the recent release of the Gemini 2.5 Pro (June 2025) model, rivaling others including OpenAI, DeepSeek and Grok
📽️ Google Veo 3 now ranks second in the Artificial Analysis Video Arena Leaderboard only behind ByteDance’s new Seedance 1.0 model
🖼️ Google Imagen 4 now occupies 2 out of the top 5 positions on the Artificial Analysis Image Arena Leaderboard
👨🏭 Google has a full stack AI offering with offerings across the application layer, models, cloud inference and hardware TPUs)
Google has consistently been shipping intelligence increases in its Gemini Pro series
Google Veo 3 now occupies second place in the Artificial Analysis Video Arena, after originally debuting in the first place. Still a significant leap over Google Veo 2!
DeepSeek’s R1 leaps over xAI, Meta and Anthropic to be tied as the world’s #2 AI Lab and the undisputed open-weights leader
DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index, our index of 7 leading evaluations that we run independently across all leading models. That’s the same magnitude of increase as the difference between OpenAI’s o1 and o3 (62 to 70).
This positions DeepSeek R1 as higher intelligence than xAI’s Grok 3 mini (high), NVIDIA’s Llama Nemotron Ultra, Meta’s Llama 4 Maverick, Alibaba’s Qwen 3 253 and equal to Google’s Gemini 2.5 Pro.
Breakdown of the model’s improvement:
🧠 Intelligence increases across the board: Biggest jumps seen in AIME 2024 (Competition Math, +21 points), LiveCodeBench (Code generation, +15 points), GPQA Diamond (Scientific Reasoning, +10 points) and Humanity’s Last Exam (Reasoning & Knowledge, +6 points)
🏠 No change to architecture: R1-0528 is a post-training update with no change to the V3/R1 architecture - it remains a large 671B model with 37B active parameters
🧑💻 Significant leap in coding skills: R1 is now matching Gemini 2.5 Pro in the Artificial Analysis Coding Index and is behind only o4-mini (high) and o3
🗯️ Increased token usage: R1-0528 used 99 million tokens to complete the evals in Artificial Analysis Intelligence Index, 40% more than the original R1’s 71 million tokens - ie. the new R1 thinks for longer than the original R1. This is still not the highest token usage number we have seen: Gemini 2.5 Pro is using 30% more tokens than R1-0528
Takeaways for AI:
👐 The gap between open and closed models is smaller than ever: open weights models have continued to maintain intelligence gains in-line with proprietary models. DeepSeek’s R1 release in January was the first time an open-weights model achieved the #2 position and DeepSeek’s R1 update today brings it back to the same position
🇨🇳 China remains neck and neck with the US: models from China-based AI Labs have all but completely caught up to their US counterparts, this release continues the emerging trend. As of today, DeepSeek leads US based AI labs including Anthropic and Meta in Artificial Analysis Intelligence Index
🔄 Improvements driven by reinforcement learning: DeepSeek has shown substantial intelligence improvements with the same architecture and pre-train as their original DeepSeek R1 release. This highlights the continually increasing importance of post-training, particularly for reasoning models trained with reinforcement learning (RL) techniques. OpenAI disclosed a 10x scaling of RL compute between o1 and o3 - DeepSeek have just demonstrated that so far, they can keep up with OpenAI’s RL compute scaling. Scaling RL demands less compute than scaling pre-training and offers an efficient way of achieving intelligence gains, supporting AI Labs with fewer GPUs
See further analysis below 👇
DeepSeek has maintained its status as amongst AI labs leading in frontier AI intelligence
Today’s DeepSeek R1 update is substantially more verbose in its responses (including considering reasoning tokens) than the January release. DeepSeek R1 May used 99M tokens to run the 7 evaluations in our Intelligence Index, +40% more tokens than the prior release