We’re releasing the Artificial Analysis AI Adoption Survey Report for H1 2025 based on >1,000 responses from developers, product managers and executives adopting AI
The Artificial Analysis AI Adoption Survey Report examines key trends in AI usage, analyzing adoption rates, primary use cases driving AI’s growth, and demand across chatbots, coding agents, LLM model families, providers, and chip companies.
A highlights version of the report is available for download on our website for a limited time.
We unpack 6 trends defining the adoption of AI for organizations in the first half of 2025:
1)⚡ AI has hit production: ~45% are using AI in production, while an additional 50% are prototyping or exploring uses with AI
2)💡 Engineering and R&D is the clear frontrunner use case: 66% are considering AI for Engineering/R&D, well ahead of the next most popular use cases in Customer Support and Sales & Marketing
3) 📈 Google, xAI, DeepSeek gain share while Meta and Mistral lose share: ~80% are using/considering Google Gemini, 53% DeepSeek & 31% xAI Grok marking a substantial increase in demand since 2024
4) 🔄 Companies are increasingly diversifying their AI use: Average number of LLMs used/considered has increased from ~2.8 in 2024 to ~4.7 in 2025, as organizations mature their AI use cases
5) 🏗️ Organizations are taking different approaches to Build vs. Buy: 32% of respondents favor building; 27% buying and 25% a hybrid approach
6) 🇨🇳 Organizations are open to Chinese models, if hosted outside of China: 55% would be willing to use LLMs from China-based AI labs, if hosted outside of China
The survey was conducted between April and June 2025, collecting responses from 1,000+ individuals across 90+ countries.
Below we share excerpts covering select important takeaways:
ChatGPT dominates AI chat adoption, followed by Gemini and Claude. Other notable players include Perplexity, xAI Grok and Microsoft Copilot
GitHub Copilot and Cursor dominate the market as the most popular AI coding tools, ahead of Claude Code and Gemini Code Assist (Note: the survey was conducted before the release of OpenAI Codex)
Most orgs will deploy AI in Engineering & R&D before Customer Support, Sales & Marketing, or IT & Cybersecurity; It’s also the top area for AI agent use
Developers consider an average of 4.7 LLM families with OpenAI GPT/o, Google Gemini and Anthropic Claude being the most popular, and DeepSeek as the top open-weights choice
The highlights version of the report is available for download on our website for a limited time.
xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model.
We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68. Full results breakdown below.
This is the first time that @elonmusk's @xai has the lead the AI frontier. Grok 3 scored competitively with the latest models from OpenAI, Anthropic and Google - but Grok 4 is the first time that our Intelligence Index has shown xAI in first place.
We tested Grok 4 via the xAI API. The version of Grok 4 deployed for use on X/Twitter may be different to the model available via API. Consumer application versions of LLMs typically have instructions and logic around the models that can change style and behavior.
Grok 4 is a reasoning model, meaning it ‘thinks’ before answering. The xAI API does not share reasoning tokens generated by the model.
Grok 4’s pricing is equivalent to Grok 3 at $3/$15 per 1M input/output tokens ($0.75 per 1M cached input tokens). The per-token pricing is identical to Claude 4 Sonnet, but more expensive than Gemini 2.5 Pro ($1.25/$10, for <200K input tokens) and o3 ($2/$8, after recent price decrease). We expect Grok 4 to be available via the xAI API, via the Grok chatbot on X, and potentially via Microsoft Azure AI Foundry (Grok 3 and Grok 3 mini are currently available on Azure).
Key benchmarking results:
➤ Grok 4 leads in not only our Artificial Analysis Intelligence Index but also our Coding Index (LiveCodeBench & SciCode) and Math Index (AIME24 & MATH-500)
➤ All-time high score in GPQA Diamond of 88%, representing a leap from Gemini 2.5 Pro’s previous record of 84%
➤ All-time high score in Humanity’s Last Exam of 24%, beating Gemini 2.5 Pro’s previous all-time high score of 21%. Note that our benchmark suite uses the original HLE dataset (Jan '25) and runs the text-only subset with no tools
➤ Joint highest score for MMLU-Pro and AIME 2024 of 87% and 94% respectively
➤ Speed: 75 output tokens/s, slower than o3 (188 tokens/s), Gemini 2.5 Pro (142 tokens/s), Claude 4 Sonnet Thinking (85 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s)
Other key information:
➤ 256k token context window. This is below Gemini 2.5 Pro’s context window of 1 million tokens, but ahead of Claude 4 Sonnet and Claude 4 Opus (200k tokens), o3 (200k tokens) and R1 0528 (128k tokens)
➤ Supports text and image input
➤ Supports function calling and structured outputs
See below for further analysis 👇
Grok 4 scores higher in Artificial Analysis Intelligence Index than any other model. Its pricing is higher than OpenAI’s o3, Google’s Gemini 2.5 Pro and Anthropic’s Claude 4 Sonnet - but lower than Anthropic’s Claude 4 Opus and OpenAI’s o3-pro.
Full set of intelligence benchmarks that we have run independently on xAI’s Grok 4 API:
Google is firing on all cylinders across AI - Gemini 2.5 Pro is equal #2 in intelligence, Veo 3 and Imagen 4 are amongst the leaders in media generation, and with TPUs they're the only vertically integrated player
🧠 Google is now equal #2 Artificial Analysis Intelligence Index with the recent release of the Gemini 2.5 Pro (June 2025) model, rivaling others including OpenAI, DeepSeek and Grok
📽️ Google Veo 3 now ranks second in the Artificial Analysis Video Arena Leaderboard only behind ByteDance’s new Seedance 1.0 model
🖼️ Google Imagen 4 now occupies 2 out of the top 5 positions on the Artificial Analysis Image Arena Leaderboard
👨🏭 Google has a full stack AI offering with offerings across the application layer, models, cloud inference and hardware TPUs)
Google has consistently been shipping intelligence increases in its Gemini Pro series
Google Veo 3 now occupies second place in the Artificial Analysis Video Arena, after originally debuting in the first place. Still a significant leap over Google Veo 2!
DeepSeek’s R1 leaps over xAI, Meta and Anthropic to be tied as the world’s #2 AI Lab and the undisputed open-weights leader
DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index, our index of 7 leading evaluations that we run independently across all leading models. That’s the same magnitude of increase as the difference between OpenAI’s o1 and o3 (62 to 70).
This positions DeepSeek R1 as higher intelligence than xAI’s Grok 3 mini (high), NVIDIA’s Llama Nemotron Ultra, Meta’s Llama 4 Maverick, Alibaba’s Qwen 3 253 and equal to Google’s Gemini 2.5 Pro.
Breakdown of the model’s improvement:
🧠 Intelligence increases across the board: Biggest jumps seen in AIME 2024 (Competition Math, +21 points), LiveCodeBench (Code generation, +15 points), GPQA Diamond (Scientific Reasoning, +10 points) and Humanity’s Last Exam (Reasoning & Knowledge, +6 points)
🏠 No change to architecture: R1-0528 is a post-training update with no change to the V3/R1 architecture - it remains a large 671B model with 37B active parameters
🧑💻 Significant leap in coding skills: R1 is now matching Gemini 2.5 Pro in the Artificial Analysis Coding Index and is behind only o4-mini (high) and o3
🗯️ Increased token usage: R1-0528 used 99 million tokens to complete the evals in Artificial Analysis Intelligence Index, 40% more than the original R1’s 71 million tokens - ie. the new R1 thinks for longer than the original R1. This is still not the highest token usage number we have seen: Gemini 2.5 Pro is using 30% more tokens than R1-0528
Takeaways for AI:
👐 The gap between open and closed models is smaller than ever: open weights models have continued to maintain intelligence gains in-line with proprietary models. DeepSeek’s R1 release in January was the first time an open-weights model achieved the #2 position and DeepSeek’s R1 update today brings it back to the same position
🇨🇳 China remains neck and neck with the US: models from China-based AI Labs have all but completely caught up to their US counterparts, this release continues the emerging trend. As of today, DeepSeek leads US based AI labs including Anthropic and Meta in Artificial Analysis Intelligence Index
🔄 Improvements driven by reinforcement learning: DeepSeek has shown substantial intelligence improvements with the same architecture and pre-train as their original DeepSeek R1 release. This highlights the continually increasing importance of post-training, particularly for reasoning models trained with reinforcement learning (RL) techniques. OpenAI disclosed a 10x scaling of RL compute between o1 and o3 - DeepSeek have just demonstrated that so far, they can keep up with OpenAI’s RL compute scaling. Scaling RL demands less compute than scaling pre-training and offers an efficient way of achieving intelligence gains, supporting AI Labs with fewer GPUs
See further analysis below 👇
DeepSeek has maintained its status as amongst AI labs leading in frontier AI intelligence
Today’s DeepSeek R1 update is substantially more verbose in its responses (including considering reasoning tokens) than the January release. DeepSeek R1 May used 99M tokens to run the 7 evaluations in our Intelligence Index, +40% more tokens than the prior release
Google’s Gemini 2.5 Flash costs 150x more than Gemini 2.0 Flash to run Artificial Analysis Intelligence Index
The increase is driven by:
➤ 9x more expensive output tokens - $3.5 per million with reasoning on ($0.6 with reasoning off) vs $0.4 for Gemini 2.0 Flash
➤ 17x higher token usage across our evals due to adding reasoning - the greatest volume of tokens used in reasoning that we have observed for any model to date
This doesn’t mean Gemini 2.5 Flash is not a compelling value proposition - its 12 point bump in Artificial Analysis Intelligence Index makes it suitable for a range of use cases that may not perform sufficiently well on Gemini 2.0 Flash. With per-token pricing still slightly below OpenAI’s o4-mini, Gemini 2.5 Flash may still be a cost-effective option for certain use cases.
It does mean that Gemini 2.5 Flash with Reasoning may not be a clear upgrade for everyone - for many use cases, developers may want to stay with 2.0 Flash or use 2.5 Flash with reasoning off.
Breakdown of token usage, pricing and end-to-end latency.
Today’s GPT-4o update is actually big - it leapfrogs Claude 3.7 Sonnet (non-reasoning) and Gemini 2.0 Flash in our Intelligence Index and is now the leading non-reasoning model for coding
This makes GPT-4o the second highest scoring non-reasoning model (excludes o3-mini, Gemini 2.5 Pro, etc), coming in just behind DeepSeek’s V3 0324 release earlier this week.
Key benchmarking results:
➤ Significant jump in the Artificial Analysis Intelligence Index from 41 to 50, putting GPT-4o (March 2025) ahead of Claude 3.7 Sonnet
➤ Now the the leading non-reasoning model for coding: 🥇#1 in the Artificial Analysis Coding Index and in LiveCodeBench, surpassing DeepSeek V3 (March 2025) and Claude 3.7 Sonnet
@OpenAI has committed an all-new AI model naming sin of simply refusing to name the model at all, so we will be referring to it as GPT-4o (March 2025).
This update has also been released in a fairly confusing way - the March 2025 version of GPT-4o is currently available:
➤ In ChatGPT, when users select GPT-4o in the model selector
➤ Via API on the chatgpt-4o-latest endpoint - a non-dated endpoint that OpenAI described at launch as intended for research use only, with developers encouraged to use the dated snapshot versions of GPT-4o for most API use cases
As of today, this means that the chatgpt-4o-latest endpoint is serving a significantly better model than the proper API versions GPT-4o (ie. the August 2024 and November 2024 snapshots).
We recommend some caution for developers considering moving workloads to the chatgpt-4o-latest endpoint given OpenAI’s previous guidance, and note that OpenAI will likely release a dated API snapshot soon. We also note that OpenAI prices the chatgpt-4o-latest endpoint at $5/$15 per million input/output tokens, whereas the API snapshots are priced at $2.5/$10.
See below for further analysis 👇
GPT-4o (March 2025) is now the leading non-reasoning coding model, surpassing DeepSeek V3 (March 2025) and Claude 3.7 Sonnet in the Artificial Analysis Coding Index (made up of LiveCodeBench and SciCode) and is #1 in LiveCodeBench
GPT-4o (March 2025) still lags behind reasoning models, though these can be considered separately considering their higher latency and typically higher cost
DeepSeek takes the lead: DeepSeek V3-0324 is now the highest scoring non-reasoning model
This is the first time an open weights model is the leading non-reasoning model, a milestone for open source.
DeepSeek V3-0324 has jumped forward 7 points in Artificial Analysis Intelligence Index, now sitting ahead of all other non-reasoning models. It sits behind DeepSeek’s own R1 in Intelligence Index, as well as other reasoning models from OpenAI, Anthropic and Alibaba, but this does not take away from the impressiveness of this accomplishment. Non-reasoning models answer immediately without taking time to ‘think’, making them useful in latency-sensitive use cases.
Three months ago, DeepSeek released V3 and we we wrote that there is a new leader in open source AI - noting that V3 came close to leading proprietary models from Anthropic and Google but did not surpass them.
Today, DeepSeek are not just releasing the best open source model - DeepSeek are now driving the frontier of non-reasoning open weights models, eclipsing all proprietary non-reasoning models, including Gemini 2.0 Pro, Claude 3.7 Sonnet and Llama 3.3 70B. This release is arguably even more impressive than R1 - and potentially indicates that R2 is going to be another significant leap forward.
Most other details are identical to the December 2024 version of DeepSeek V3, including:
➤ Context window: 128k (limited to 64k on DeepSeek’s first-party API)
➤ Total parameters: 671B (requires >700GB of GPU memory to run in native FP8 precision - still not something you can run at home!)
➤ Active parameters: 37B
➤ Native FP8 precision
➤Text only - no multimodal inputs or outputs
➤ MIT License
DeepSeek V3-0324 marks the first time an open weights model has been the leading non-reasoning model.
Compared to leading reasoning models, including DeepSeek’s own R1, DeepSeek V3-0324 remains behind - but for many uses, the increased latency associated with letting reasoning models ‘think’ before answering makes them unusable.