Latest Twitter Threads by @ArtificialAnlys on Thread Reader App

Nov 25 • 7 tweets • 5 min read

Anthropic’s new Claude Opus 4.5 is the #2 most intelligent model in the Artificial Analysis Intelligence Index, narrowly behind Google’s Gemini 3 Pro and tying OpenAI’s GPT-5.1 (high)

Claude Opus 4.5 delivers a substantial intelligence uplift over Claude Sonnet 4.5 (+7 points on the Artificial Analysis Intelligence Index) and Claude Opus 4.1 (+11 points), establishing it as @AnthropicAI's new leading model. Anthropic has dramatically cut per-token pricing for Claude Opus 4.5 to $5/$25 per million input/output tokens. However, compared to the prior Claude Opus 4.1 model it used 60% more tokens to complete our Intelligence Index evaluations (48M vs. 30M). This translates to a substantial reduction in the cost to run our Intelligence Index evaluations from $3.1k to $1.5k, but not as significant as the headline price cut implies. Despite Claude Opus 4.5 using substantially more tokens to complete our Intelligence Index, the model still cost significantly more than other models including Gemini 3 Pro (high), GPT-5.1 (high), and Claude Sonnet 4.5 (Thinking), and among all models only cost less than Grok 4 (Reasoning).

Key benchmarking takeaways:

➤ 🧠 Anthropic’s most intelligent model: In reasoning mode, Claude Opus 4.5 scores 70 on the Artificial Analysis Intelligence Index. This is a jump of +7 points from Claude Sonnet 4.5 (Thinking), which was released in September 2025, and +11 points from Claude Opus 4.1 (Thinking). Claude Opus 4.5 is now the second most intelligent model. It places ahead of Grok 4 (65) and Kimi K2 Thinking (67), ties GPT-5.1 (high, 70), and trails only Gemini 3 Pro (73). Claude Opus 4.5 (Thinking) scores 5% on CritPt, a frontier physics eval reflective of research assistant capabilities. It sits only behind Gemini 3 Pro (9%) and ties GPT-5.1 (high, 5%)

➤ 📈 Largest increases in coding and agentic tasks: Compared to Claude Sonnet 4.5 (Thinking), the biggest uplifts appear across coding, agentic tasks, and long-context reasoning, including LiveCodeBench (+16 p.p.), Terminal-Bench Hard (+11 p.p.), 𝜏²-Bench Telecom (+12 p.p.), AA-LCR (+8 p.p.), and Humanity's Last Exam (+11 p.p.). Claude Opus achieves Anthropic’s best scores yet across all 10 benchmarks in the Artificial Analysis Intelligence Index. It also earns the highest score on Terminal-Bench Hard (44%) of any model and ties Gemini 3 Pro on MMLU-Pro (90%)

➤ 📚 Knowledge and Hallucination: In our recently launched AA-Omniscience Index, which measures embedded knowledge and hallucination of language models, Claude Opus 4.5 places 2nd with a score of 10. It sits only behind Gemini 3 Pro Preview (13) and ahead of Claude Opus 4.1 (Thinking, 5) and GPT-5.1 (high, 2). Claude Opus 4.5 (Thinking) scores the second-highest accuracy (43%) and has the 4th-lowest hallucination rate (58%), trailing only Claude Haiku (Thinking, 26%), Claude Sonnet 4.5 (Thinking, 48%), and GPT-5.1 (high). Claude Opus 4.5 continues to demonstrate Anthropic’s leadership in AI safety with a lower hallucination rate than select other frontier models such as Grok 4 and Gemini 3 Pro

➤ ⚡ Non-reasoning performance: In non-reasoning mode, Claude Opus 4.5 scores 60 on the Artificial Analysis Intelligence Index and is the most intelligent non-reasoning model. It places ahead of Qwen3 Max (55), Kimi K2 0905 (50), and Claude Sonnet 4.5 (50)

➤ ⚙️ Token efficiency: Anthropic continues to demonstrate impressive token efficiency. It has improved intelligence without a significant increase in token usage (compared to Claude Sonnet 4.5, evaluated with a maximum reasoning budget of 64k tokens). Claude Opus 4.5 uses 48M output tokens to run the Artificial Analysis Intelligence Index. This is lower than other frontier models, such as Gemini 3 Pro (high, 92M), GPT-5.1 (high, 81M), and Grok 4 (Reasoning, 120M)

➤ 💲 Pricing: Anthropic has reduced the per-token pricing of Claude Opus 4.5 compared to Claude Opus 4.1. Claude Opus 4.5 is priced at $5/$25 per 1M input/output tokens (vs. $15/$75 for Claude Opus 4.1). This positions it much closer to Claude Sonnet 4.5 ($3/$15 per 1M tokens) while offering higher intelligence in thinking mode

Key model details:

➤ 📏 Context window: 200K tokens

➤ 🪙 Max output tokens: 64K tokens

➤ 🌐 Availability: Claude Opus 4.5 is available via Anthropic‘s API, Google Vertex, Amazon Bedrock and Microsoft Azure. Claude Opus 4.5 is also available via Claude app and Claude Code

A key differentiator for the Claude models remains that they are substantially more token-efficient than all other reasoning models. Claude Opus 4.5 has significantly increased intelligence without a large increase in output tokens, differing substantially from other model families that rely on greater reasoning at inference time (i.e., more output tokens). On the Output Tokens Used in Artificial Analysis Intelligence Index vs Intelligence Index chart, Claude 4.5 Opus (Thinking) sits on the Pareto frontier.

Nov 18 • 10 tweets • 6 min read

Gemini 3 Pro is the new leader in AI. Google has the leading language model for the first time, with Gemini 3 Pro debuting +3 points above GPT-5.1 in our Artificial Analysis Intelligence Index

@GoogleDeepMind gave us pre-release access to Gemini 3 Pro Preview. The model outperforms all other models in Artificial Analysis Intelligence Index. It demonstrates strength across the board, coming in first in 5 of the 10 evaluations that make up Intelligence Index. Despite these intelligence gains, Gemini 3 Pro Preview shows improved token efficiency from Gemini 2.5 Pro, using significantly fewer tokens on the Intelligence Index than other leading models such as Kimi K2 Thinking and Grok 4. However, given its premium pricing ($2/$12 per million input/output tokens for <200K context), Gemini 3 Pro is among the most expensive models to run our Intelligence Index evaluations.

Key takeaways:

📖 Leading intelligence: Gemini 3 Pro Preview is the leading model in 5 of 10 evals in the Artificial Analysis Intelligence Index, including GPQA Diamond, MMLU-Pro, HLE, LiveCodeBench and SciCode. Its score of 37% on Humanity’s Last Exam is particularly impressive, improving on the previous best model by more than 10 percentage points. It also is leading in AA-Omniscience, Artificial Analysis’ new knowledge and hallucination evaluation, coming first in both Omniscience Index (our lead metric that takes off points for incorrect answers) and Omniscience Accuracy (percentage correct). Given that factual recall correlates closely with model size, this may point to Gemini 3 Pro being a much larger model than its competitors

💻 Advanced coding and agentic capabilities: Gemini 3 Pro Preview leads two of the three coding evaluations in the Artificial Analysis Intelligence Index, including an impressive 56% in SciCode, an improvement of over 10 percentage points from the previous highest score. It is also strong in agentic contexts, achieving the second highest score in Terminal-Bench Hard and Tau2-Bench Telecom

🖼️ Multimodal capabilities: Gemini 3 Pro Preview is a multi-modal model, with the ability to take text, images, video and audio as input. It scores the highest of any model on MMMU-Pro, a benchmark that tests reasoning abilities with image inputs. Google now occupies the first, third and fourth position in our MMMU-Pro leaderboard (with GPT-5.1 taking out second place just last week)

💲Premium Pricing: To measure cost, we report Cost to Run the Artificial Analysis Intelligence Index, which combines input and output token prices with token efficiency to reflect true usage cost. Despite the improvement in token efficiency from Gemini 2.5 Pro, Gemini 3 Pro Preview costs more to run. Its higher token pricing of $2/$12 USD per million input/output tokens (≤200k token context) results in a 12% increase in the cost to run the Artificial Analysis Intelligence Index compared to its predecessor, and the model is among the most expensive to run on our Intelligence Index. Google also continues to price long context workloads higher than lower context workloads, charging $4/$18 per million input/output tokens for ≥200k token context.

⚡ Speed: Gemini 3 Pro Preview has comparable speeds to Gemini 2.5 Pro, with 128 output tokens per second. This places it ahead of other frontier models including GPT-5.1 (high), Kimi K2 Thinking and Grok 4. This is potentially supported by Google’s first-party TPU accelerators

Other details: Gemini 3 Pro Preview has a 1 million token context window, and includes support for tool calling, structured outputs, and JSON mode

See below for further analysis

For the first time, Google has the most intelligent model, with Gemini 3 Pro Preview improving on the previous most intelligent model, OpenAI’s GPT-5.1 (high), by 3 points

Nov 17 • 8 tweets • 7 min read

Announcing AA-Omniscience, our new benchmark for knowledge and hallucination across >40 topics, where all but three models are more likely to hallucinate than give a correct answer

Embedded knowledge in language models is important for many real world use cases. Without knowledge, models make incorrect assumptions and are limited in their ability to operate in real world contexts. Tools like web search can support but models need to know what to search for (e.g. models should not search for ‘Multi Client Persistence’ for an MCP query when it clearly refers to ‘Model Context Protocol’).

Hallucination of factual information is a barrier to being able to rely on models and has been perpetuated by every major evaluation dataset. Grading correct answers with no penalty for incorrect answers creates an incentive for models (and the labs training them) to attempt every question. This problem is clearest when it comes to knowledge: factual information should never be made up, while in other contexts attempts that might not work are useful (e.g. coding new features).

Omniscience Index is the the key metric we report for AA-Omniscience, and it punishes hallucinations by deducting points where models have guessed over admitting they do not know the answer. AA-Omniscience shows that all but three models are more likely to hallucinate than provide a correct answer when given a difficult question. AA-Omniscience will complement the Artificial Analysis Intelligence Index to incorporate measurement of knowledge and probability of hallucination.

Details below, and more charts in the thread.

AA-Omniscience details:

- 🔢6,000 questions across 42 topics within 6 domains (’Business’, ‘Humanities & Social Sciences’, ‘Health’, ‘Law’, ‘Software Engineering’, and ‘Science, Engineering & Mathematics’)
- 🔍 89 sub-topics including Python data libraries, Public Policy, Taxation, and more, giving a sharper view of where models excel and where they fall short across nuanced domains
- 🔄 Incorrect answers are penalized in our Knowledge Reliability Index metrics to punish hallucinations
- 📊3 Metrics: Accuracy (% correct), Hallucination rate (% incorrect of incorrect/abstentions), Omniscience Index (+1 for correct, -1 for incorrect where answered, 0 for abstentions where the model did not try to answer)
- 🤗 Open source test dataset: We’re open sourcing 600 questions (10%) to support labs develop factual and reliable models. Topic distribution and model performance follows the full set (@huggingface link below)
- 📃 Paper: See below for a link to the research paper

Key findings:

- 🥇 Claude 4.1 Opus takes first place in Omniscience Index, followed by last week’s GPT-5.1 and Grok 4: Even the best frontier models score only slightly above 0, meaning they produce correct answers on the difficult questions that make up AA-Omniscience only marginally more often than incorrect ones. @AnthropicAI’s leadership is driven by low hallucination rate, whereas OpenAI and xAI’s positions are primarily driven by higher accuracy (percentage correct).

- 🥇 xAI’s Grok 4 takes first place in Omniscience Accuracy (our simple ‘percentage correct’ metric), followed by GPT-5 and Gemini 2.5 Pro: @xai's win may be enabled by scaling total parameters and pre-training compute: @elonmusk revealed last week that Grok 4 has 3 trillion total parameters, which may be larger than GPT-5 and other proprietary models

- 🥇 Claude sweeps the hallucination leaderboard: Anthropic takes the top three spots for lowest hallucination rate, with Claude 4.5 Haiku leading at 28%, over three times lower than GPT-5 (high) and Gemini 2.5 Pro. Claude 4.5 Sonnet and Claude 4.1 Opus follow in second and third at 48%

- 💭 High knowledge does not guarantee low hallucination: Hallucination rate measures how often a model guesses when it lacks the required knowledge. Models with the highest accuracy, including the GPT-5 models and Gemini 2.5 Pro, do not lead the Omniscience Index due to their tendency to guess over abstaining. Anthropic models tend to manage uncertainty better, with Claude 4.5 Haiku achieving the lowest hallucination rate at 26%, ahead of 4.5 Sonnet and 4.1 Opus (48%)

- 📊 Models vary by domain: Models differ in their performance across the six domains of AA-Omniscience - no model dominates across all. While Anthropic’s Claude 4.1 Opus leads in Law, Software Engineering, and Humanities & Social Sciences, GPT-5.1 from @OpenAI achieves the highest reliability on Business questions, and xAI’s Grok 4 performs best in Health and in Science, Engineering & Mathematics. Model choice should align with the the use case rather than choosing the overall leader

- 📈 Larger models score higher on accuracy, but not always reliability: Larger models tend to have higher levels of embedded knowledge, with Kimi K2 Thinking and DeepSeek R1 (0528) topping accuracy charts over smaller models. This advantage does not always hold on the Omniscience Index. For example, Llama 3.1 405B from @AIatMeta beats larger Kimi K2 variants due to having one of the lowest hallucination rates among models (51%)

Grok 4 by @xai, GPT-5 by @OpenAI and Gemini 2.5 Pro by @GoogleDeepMind achieve the highest accuracy in AA-Omniscience. The reason they do not achieve the highest Omniscience Index due to the low hallucination rates of @AnthropicAI’s Claude models

Nov 6 • 4 tweets • 2 min read

Inworld TTS 1 Max is the new leader on the Artificial Analysis Speech Arena Leaderboard, surpassing MiniMax’s Speech-02 series and OpenAI’s TTS-1 series

The Artificial Analysis Speech Arena ranks leading Text to Speech models based on human preferences. In the arena, users compare two pieces of generated speech side by side and select their preferred output without knowing which models created them. The speech arena includes prompts across four real-world categories of prompts: Customer Service, Knowledge Sharing, Digital Assistants, and Entertainment.

Inworld TTS 1 Max and Inworld TTS 1 both support 12 languages including English, Spanish, French, Korean, and Chinese, and voice cloning from 2-15 seconds of audio. Inworld TTS 1 processes ~153 characters per second of generation time on average, with the larger model, Inworld TTS 1 Max processing ~69 characters on average. Both models also support voice tags, allowing users to add emotion, delivery style, and non-verbal sounds, such as “whispering”, “cough”, and “surprised”.

Both TTS-1 and TTS-1-Max are transformer-based, autoregressive models employing LLaMA-3.2-1B and LLaMA-3.1-8B respectively as their SpeechLM backbones.

See the leading models in the Speech Arena, and listen to sample clips below 🎧

Sample prompt on Inworld TTS 1 Max: “Your gut microbiome contains trillions of bacteria that influence digestion, immunity, and even mental health through the gut-brain axis.”

Oct 2 • 5 tweets • 3 min read

IBM has launched Granite 4.0 - a new family of open weights language models ranging in size from 3B to 32B. Artificial Analysis was provided pre-release access, and our benchmarking shows Granite 4.0 H Small (32B/9B total/active parameters) scoring an Intelligence Index of 23, with a particular strength in token efficiency

Today IBM released four new models: Granite 4.0 H Small (32B/9B total/active parameters), Granite 4.0 H Tiny (7B/1B), Granite 4.0 H Micro (3B/3B) and Granite 4.0 Micro (3B/3B). We evaluated Granite 4.0 Small (in non-reasoning mode) and Granite 4.0 Micro using the Artificial Analysis Intelligence Index. Granite 4.0 models combine a small amount of standard transformer-style attention layers with a majority of Mamba layers which claims to reduce memory requirements without impacting performance

Key benchmarking takeaways:
➤🧠 Granite 4.0 H Small Intelligence: In non-reasoning, Granite 4.0 H Small scores 23 on the Artificial Analysis Intelligence index - a jump of +8 points on the Index compared to IBM Granite 3.3 8B (Non Reasoning). Granite 4.0 H Small places ahead of Gemma 3 27B (22) but behind Mistral Small 3.2 (29), EXAONE 4.0 32B (Non-Reasoning, 30) and Qwen3 30B A3B 2507 (Non-Reasoning, 37) in intelligence
➤⚡ Granite 4.0 Micro Intelligence: On the Artificial Analysis Intelligence Index, Granite 4.0 Micro scores 16. It places ahead of Gemma 3 4B (15) and LFM 2 2.6B (12).
➤⚙️ Token efficiency: Granite 4.0 H Small and Micro demonstrate impressive token efficiency - Granite 4.0 Small uses 5.2M, while Granite 4.0 Micro uses 6.7M tokens to run the Artificial Analysis Intelligence Index. Both models fewer tokens than Granite 3.3 8B (Non-Reasoning) and most other open weights non-reasoning models smaller than 40B total parameters (except Qwen3 0.6B which uses 1.9M output tokens)

Key model details:
➤🌐 Availability: All four models are available on Hugging Face. Granite 4.0 H Small is available on Replicate and is priced at $0.06/$0.25 per 1M input/output tokens
➤📏 Context Window: 128K tokens
➤©️ Licensing: The Granite 4.0 models are available under the Apache 2.0 license

Granite 4.0 H Small’s (Non Reasoning) output token efficiency and per token pricing offers a compelling tradeoff between intelligence and Cost to Run Artificial Analysis Intelligence Index

Oct 1 • 7 tweets • 3 min read

Reve V1 debuts at #3 in the Artificial Analysis Image Editing Leaderboard, trailing only Gemini 2.5 Flash (Nano-Banana) and Seedream 4.0!

Reve V1 is the first image editing model from Reve AI, and is built on their latest text to image model. The Reve V1 model supports both single and multi-image edits, with the ability to combine multiple reference images into a single output image.

The model is available via the Reve web app, which offers free access with a daily usage limit, or expanded usage through their Pro plan at $20/month.

Reve V1 is also accessible via the Reve API Beta priced at $40/1k images, similar to competitors like Gemini 2.5 Flash ($39/1k) and Seedream 4.0 ($30/1k).

See the Reve V1 Image Editing model for yourself in the thread below 🧵

[Prompt 1/5] Change the sign to state "SCHOOL Zone Ahead”

Aug 7 • 5 tweets • 4 min read

OpenAI gave us early access to GPT-5: our independent benchmarks verify a new high for AI intelligence. We have tested all four GPT-5 reasoning effort levels, revealing 23x differences in token usage and cost between the ‘high’ and ‘minimal’ options and substantial differences in intelligence

We have run our full suite of eight evaluations independently across all reasoning effort configurations of GPT-5 and are reporting benchmark results for intelligence, token usage, and end-to-end latency.

What @OpenAI released: OpenAI has released a single endpoint for GPT-5, but different reasoning efforts offer vastly different intelligence. GPT-5 with reasoning effort “High” reaches a new intelligence frontier, while “Minimal” is near GPT-4.1 level (but more token efficient).

Takeaways from our independent benchmarks:
⚙️ Reasoning effort configuration: GPT-5 offers four reasoning effort configurations: high, medium, low, and minimal. Reasoning effort options steer the model to “think” more or less hard for each query, driving large differences in intelligence, token usage, speed, and cost.

🧠 Intelligence achieved ranges from frontier to GPT-4.1 level: GPT-5 sets a new standard with a score of 68 on our Artificial Analysis Intelligence Index (MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME, IFBench & AA-LCR) at High reasoning effort. Medium (67) is close to o3, Low (64) sits between DeepSeek R1 and o3, and Minimal (44) is close to GPT-4.1. While High sets a new standard, the increase over o3 is not comparable to the jump from GPT-3 to GPT-4 or GPT-4o to o1.

💬 Token usage varies 23x between reasoning efforts: GPT-5 with High reasoning effort used more tokens than o3 (82M vs. 50M) to complete our Index, but still fewer than Gemini 2.5 Pro (98M) and DeepSeek R1 0528 (99M). However, Minimal reasoning effort used only 3.5M tokens which is substantially less than GPT-4.1, making GPT-5 Minimal significantly more token-efficient for similar intelligence.

📖 Long Context Reasoning: We released our own Long Context Reasoning (AA-LCR) benchmark earlier this week to test the reasoning capabilities of models across long sequence lengths (sets of documents ~100k tokens in total). GPT-5 stands out for its performance in AA-LCR, with GPT-5 in both High and Medium reasoning efforts topping the benchmark.

🤖 Agentic Capabilities: OpenAI also commented on improvements across capabilities increasingly important to how AI models are used, including agents (long horizon tool calling). We recently added IFBench to our Intelligence Index to cover instruction following and will be adding further evals to cover agentic tool calling to independently test these capabilities.

📡 Vibe checks: We’re testing the personality of the model through MicroEvals on our website which supports running the same prompt across models and comparing results. It’s free to use, we’ll provide an update with our perspective shortly but feel free to share your own!

See below for further analysis:

Token usage (verbosity): GPT-5 with reasoning effort high uses 23X more tokens than with reasoning effort minimal. Though in doing so achieves substantial intelligence gains, between medium and high there is less of an uplift.

Aug 6 • 6 tweets • 5 min read

Independent benchmarks of OpenAI’s gpt-oss models: gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits

OpenAI has released two versions of gpt-oss:
➤ gpt-oss-120b (116.8B total parameters, 5.1B active parameters): Intelligence Index score of 58
➤ gpt-oss-20b (20.9B total parameters, 3.6B active parameters): Intelligence Index score of 48

Size & deployment: OpenAI has released both models in MXFP4 precision: gpt-oss-120b comes in at just 60.8GB and gpt-oss-20b just 12.8GB. This means that the 120B can be run in its native precision on a single NVIDIA H100, and the 20B can be run easily on a consumer GPU or laptop with >16GB of RAM. Additionally, the relatively small proportion of active parameters will contribute to their efficiency and speed for inference: the 5.1B active parameters of the 120B model can be contrasted with Llama 4 Scout’s 109B total parameters and 17B active (a lot less sparse). This makes it possible to get dozens of output tokens/s for the 20B on recent MacBooks.

Intelligence: Both models score extremely well for their size and sparsity. We’re seeing the 120B beat o3-mini but come in behind o4-mini and o3. The 120B is the most intelligent model that can be run on a single H100 and the 20B is the most intelligent model that can be run on a consumer GPU. Both models appear to place similiarly across most of our evals, indicating no particular areas of weakness.

Comparison to other open weights models: While the larger gpt-oss-120b does not come in above DeepSeek R1 0528’s score of 59 or Qwen3 235B 2507s score of 64, it is notable that it is significantly smaller in both total and active parameters than both of those models. DeepSeek R1 has 671B total parameters and 37B active parameters, and is released natively in FP8 precision, making its total file size (and memory requirements) over 10x larger than gpt-oss-120b. Both gpt-oss-120b and 20b are text-only models (similar to competing models from DeepSeek, Alibaba and others).

Architecture: The MoE architecture at appears fairly standard. The MoE router selects the top 4 experts for each token generation. The 120B has 36 layers and 20B has 24 layers. Each layer has 64 query heads, uses Grouped Query Attention with 8 KV heads. Rotary embeddings and YaRN are used to extend context window to 128k. The 120B model activates 4.4% of total parameters per forward pass, whereas the 20B model activates 17.2% of total parameters. This may indicate that OpenAI’s perspective is that a higher degree is of sparsity is optimal for larger models. It has been widely speculated that most top models from frontier labs have been sparse MoEs for most releases since GPT-4.

API Providers: A number of inference providers have been quick to launch endpoints. We are currently benchmarking @GroqInc, @CerebrasSystems, @FireworksAI_HQ and @togethercompute on Artificial Analysis and will add more providers as they launch endpoints.

Pricing: We’re tracking median pricing across API providers of $0.15/$0.69 per million input/output tokens for the 120B and $0.08/$0.35 for the 20B. These prices put the 120B close to 10x cheaper than OpenAI’s proprietary APIs for o4-mini ($1.1/$4.4) and o3 ($2/$8).

License: Apache 2.0 license - very permissive!

See below for further analysis:

Intelligence vs. Total Parameters: gpt-oss-120B is the most intelligence model that can fit on a single H100 GPU in its native precision.

Jul 17 • 5 tweets • 2 min read

🇰🇷 South Korean AI Lab Upstage AI has just launched their first reasoning model - Solar Pro 2! The 31B parameter model demonstrates impressive performance for its size, with intelligence approaching Claude 4 Sonnet in 'Thinking' mode and is priced very competitively

Key details:
➤ Hybrid reasoning: The model offers optionality between 'reasoning' mode and standard non-reasoning mode
➤ Korean-language ability & Sovereign AI: Based in Korea, Upstage announced superior performance in Korean language evaluations. This release aligns with countries' interests to develop sovereign AI capabilities
➤ Pricing: Competitively priced at $0.5/1M tokens (input & output), significantly cheaper than comparable models including Claude 4 Sonnet Thinking ($3/$15/M input/output tokens) and Magistral Small ($0.5/$1.5/M input/output tokens)
➤ Proprietary: @upstageai has not released the model weights, though they have open-sourced previous Solar Pro models. Whether they will release Solar Pro 2's weights remains unclear as it wasn't mentioned in their announcement

Full suit of our independent intelligence evaluations:

Jul 15 • 6 tweets • 3 min read

We’re releasing the Artificial Analysis AI Adoption Survey Report for H1 2025 based on >1,000 responses from developers, product managers and executives adopting AI

The Artificial Analysis AI Adoption Survey Report examines key trends in AI usage, analyzing adoption rates, primary use cases driving AI’s growth, and demand across chatbots, coding agents, LLM model families, providers, and chip companies.

A highlights version of the report is available for download on our website for a limited time.

We unpack 6 trends defining the adoption of AI for organizations in the first half of 2025:

1)⚡ AI has hit production: ~45% are using AI in production, while an additional 50% are prototyping or exploring uses with AI

2)💡 Engineering and R&D is the clear frontrunner use case: 66% are considering AI for Engineering/R&D, well ahead of the next most popular use cases in Customer Support and Sales & Marketing

3) 📈 Google, xAI, DeepSeek gain share while Meta and Mistral lose share: ~80% are using/considering Google Gemini, 53% DeepSeek & 31% xAI Grok marking a substantial increase in demand since 2024

4) 🔄 Companies are increasingly diversifying their AI use: Average number of LLMs used/considered has increased from ~2.8 in 2024 to ~4.7 in 2025, as organizations mature their AI use cases

5) 🏗️ Organizations are taking different approaches to Build vs. Buy: 32% of respondents favor building; 27% buying and 25% a hybrid approach

6) 🇨🇳 Organizations are open to Chinese models, if hosted outside of China: 55% would be willing to use LLMs from China-based AI labs, if hosted outside of China

The survey was conducted between April and June 2025, collecting responses from 1,000+ individuals across 90+ countries.

Below we share excerpts covering select important takeaways:

ChatGPT dominates AI chat adoption, followed by Gemini and Claude. Other notable players include Perplexity, xAI Grok and Microsoft Copilot

Jul 10 • 6 tweets • 4 min read

xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model.

We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68. Full results breakdown below.

This is the first time that @elonmusk's @xai has the lead the AI frontier. Grok 3 scored competitively with the latest models from OpenAI, Anthropic and Google - but Grok 4 is the first time that our Intelligence Index has shown xAI in first place.

We tested Grok 4 via the xAI API. The version of Grok 4 deployed for use on X/Twitter may be different to the model available via API. Consumer application versions of LLMs typically have instructions and logic around the models that can change style and behavior.

Grok 4 is a reasoning model, meaning it ‘thinks’ before answering. The xAI API does not share reasoning tokens generated by the model.

Grok 4’s pricing is equivalent to Grok 3 at $3/$15 per 1M input/output tokens ($0.75 per 1M cached input tokens). The per-token pricing is identical to Claude 4 Sonnet, but more expensive than Gemini 2.5 Pro ($1.25/$10, for <200K input tokens) and o3 ($2/$8, after recent price decrease). We expect Grok 4 to be available via the xAI API, via the Grok chatbot on X, and potentially via Microsoft Azure AI Foundry (Grok 3 and Grok 3 mini are currently available on Azure).

Key benchmarking results:
➤ Grok 4 leads in not only our Artificial Analysis Intelligence Index but also our Coding Index (LiveCodeBench & SciCode) and Math Index (AIME24 & MATH-500)
➤ All-time high score in GPQA Diamond of 88%, representing a leap from Gemini 2.5 Pro’s previous record of 84%
➤ All-time high score in Humanity’s Last Exam of 24%, beating Gemini 2.5 Pro’s previous all-time high score of 21%. Note that our benchmark suite uses the original HLE dataset (Jan '25) and runs the text-only subset with no tools
➤ Joint highest score for MMLU-Pro and AIME 2024 of 87% and 94% respectively
➤ Speed: 75 output tokens/s, slower than o3 (188 tokens/s), Gemini 2.5 Pro (142 tokens/s), Claude 4 Sonnet Thinking (85 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s)

Other key information:
➤ 256k token context window. This is below Gemini 2.5 Pro’s context window of 1 million tokens, but ahead of Claude 4 Sonnet and Claude 4 Opus (200k tokens), o3 (200k tokens) and R1 0528 (128k tokens)
➤ Supports text and image input
➤ Supports function calling and structured outputs

See below for further analysis 👇

Grok 4 scores higher in Artificial Analysis Intelligence Index than any other model. Its pricing is higher than OpenAI’s o3, Google’s Gemini 2.5 Pro and Anthropic’s Claude 4 Sonnet - but lower than Anthropic’s Claude 4 Opus and OpenAI’s o3-pro.

Jun 12 • 5 tweets • 3 min read

Google is firing on all cylinders across AI - Gemini 2.5 Pro is equal #2 in intelligence, Veo 3 and Imagen 4 are amongst the leaders in media generation, and with TPUs they're the only vertically integrated player

🧠 Google is now equal #2 Artificial Analysis Intelligence Index with the recent release of the Gemini 2.5 Pro (June 2025) model, rivaling others including OpenAI, DeepSeek and Grok

📽️ Google Veo 3 now ranks second in the Artificial Analysis Video Arena Leaderboard only behind ByteDance’s new Seedance 1.0 model

🖼️ Google Imagen 4 now occupies 2 out of the top 5 positions on the Artificial Analysis Image Arena Leaderboard

👨‍🏭 Google has a full stack AI offering with offerings across the application layer, models, cloud inference and hardware TPUs)

Google has consistently been shipping intelligence increases in its Gemini Pro series

May 29 • 6 tweets • 4 min read

DeepSeek’s R1 leaps over xAI, Meta and Anthropic to be tied as the world’s #2 AI Lab and the undisputed open-weights leader

DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index, our index of 7 leading evaluations that we run independently across all leading models. That’s the same magnitude of increase as the difference between OpenAI’s o1 and o3 (62 to 70).

This positions DeepSeek R1 as higher intelligence than xAI’s Grok 3 mini (high), NVIDIA’s Llama Nemotron Ultra, Meta’s Llama 4 Maverick, Alibaba’s Qwen 3 253 and equal to Google’s Gemini 2.5 Pro.

Breakdown of the model’s improvement:
🧠 Intelligence increases across the board: Biggest jumps seen in AIME 2024 (Competition Math, +21 points), LiveCodeBench (Code generation, +15 points), GPQA Diamond (Scientific Reasoning, +10 points) and Humanity’s Last Exam (Reasoning & Knowledge, +6 points)

🏠 No change to architecture: R1-0528 is a post-training update with no change to the V3/R1 architecture - it remains a large 671B model with 37B active parameters

🧑‍💻 Significant leap in coding skills: R1 is now matching Gemini 2.5 Pro in the Artificial Analysis Coding Index and is behind only o4-mini (high) and o3

🗯️ Increased token usage: R1-0528 used 99 million tokens to complete the evals in Artificial Analysis Intelligence Index, 40% more than the original R1’s 71 million tokens - ie. the new R1 thinks for longer than the original R1. This is still not the highest token usage number we have seen: Gemini 2.5 Pro is using 30% more tokens than R1-0528

Takeaways for AI:
👐 The gap between open and closed models is smaller than ever: open weights models have continued to maintain intelligence gains in-line with proprietary models. DeepSeek’s R1 release in January was the first time an open-weights model achieved the #2 position and DeepSeek’s R1 update today brings it back to the same position

🇨🇳 China remains neck and neck with the US: models from China-based AI Labs have all but completely caught up to their US counterparts, this release continues the emerging trend. As of today, DeepSeek leads US based AI labs including Anthropic and Meta in Artificial Analysis Intelligence Index

🔄 Improvements driven by reinforcement learning: DeepSeek has shown substantial intelligence improvements with the same architecture and pre-train as their original DeepSeek R1 release. This highlights the continually increasing importance of post-training, particularly for reasoning models trained with reinforcement learning (RL) techniques. OpenAI disclosed a 10x scaling of RL compute between o1 and o3 - DeepSeek have just demonstrated that so far, they can keep up with OpenAI’s RL compute scaling. Scaling RL demands less compute than scaling pre-training and offers an efficient way of achieving intelligence gains, supporting AI Labs with fewer GPUs

See further analysis below 👇

DeepSeek has maintained its status as amongst AI labs leading in frontier AI intelligence

May 8 • 4 tweets • 3 min read

Google’s Gemini 2.5 Flash costs 150x more than Gemini 2.0 Flash to run Artificial Analysis Intelligence Index

The increase is driven by:
➤ 9x more expensive output tokens - $3.5 per million with reasoning on ($0.6 with reasoning off) vs $0.4 for Gemini 2.0 Flash
➤ 17x higher token usage across our evals due to adding reasoning - the greatest volume of tokens used in reasoning that we have observed for any model to date

This doesn’t mean Gemini 2.5 Flash is not a compelling value proposition - its 12 point bump in Artificial Analysis Intelligence Index makes it suitable for a range of use cases that may not perform sufficiently well on Gemini 2.0 Flash. With per-token pricing still slightly below OpenAI’s o4-mini, Gemini 2.5 Flash may still be a cost-effective option for certain use cases.

It does mean that Gemini 2.5 Flash with Reasoning may not be a clear upgrade for everyone - for many use cases, developers may want to stay with 2.0 Flash or use 2.5 Flash with reasoning off.

Breakdown of token usage, pricing and end-to-end latency.

Mar 28 • 5 tweets • 3 min read

Today’s GPT-4o update is actually big - it leapfrogs Claude 3.7 Sonnet (non-reasoning) and Gemini 2.0 Flash in our Intelligence Index and is now the leading non-reasoning model for coding

This makes GPT-4o the second highest scoring non-reasoning model (excludes o3-mini, Gemini 2.5 Pro, etc), coming in just behind DeepSeek’s V3 0324 release earlier this week.

Key benchmarking results:
➤ Significant jump in the Artificial Analysis Intelligence Index from 41 to 50, putting GPT-4o (March 2025) ahead of Claude 3.7 Sonnet
➤ Now the the leading non-reasoning model for coding: 🥇#1 in the Artificial Analysis Coding Index and in LiveCodeBench, surpassing DeepSeek V3 (March 2025) and Claude 3.7 Sonnet

@OpenAI has committed an all-new AI model naming sin of simply refusing to name the model at all, so we will be referring to it as GPT-4o (March 2025).

This update has also been released in a fairly confusing way - the March 2025 version of GPT-4o is currently available:
➤ In ChatGPT, when users select GPT-4o in the model selector
➤ Via API on the chatgpt-4o-latest endpoint - a non-dated endpoint that OpenAI described at launch as intended for research use only, with developers encouraged to use the dated snapshot versions of GPT-4o for most API use cases

As of today, this means that the chatgpt-4o-latest endpoint is serving a significantly better model than the proper API versions GPT-4o (ie. the August 2024 and November 2024 snapshots).

We recommend some caution for developers considering moving workloads to the chatgpt-4o-latest endpoint given OpenAI’s previous guidance, and note that OpenAI will likely release a dated API snapshot soon. We also note that OpenAI prices the chatgpt-4o-latest endpoint at $5/$15 per million input/output tokens, whereas the API snapshots are priced at $2.5/$10.

See below for further analysis 👇

GPT-4o (March 2025) is now the leading non-reasoning coding model, surpassing DeepSeek V3 (March 2025) and Claude 3.7 Sonnet in the Artificial Analysis Coding Index (made up of LiveCodeBench and SciCode) and is #1 in LiveCodeBench

Mar 25 • 4 tweets • 2 min read

DeepSeek takes the lead: DeepSeek V3-0324 is now the highest scoring non-reasoning model

This is the first time an open weights model is the leading non-reasoning model, a milestone for open source.

DeepSeek V3-0324 has jumped forward 7 points in Artificial Analysis Intelligence Index, now sitting ahead of all other non-reasoning models. It sits behind DeepSeek’s own R1 in Intelligence Index, as well as other reasoning models from OpenAI, Anthropic and Alibaba, but this does not take away from the impressiveness of this accomplishment. Non-reasoning models answer immediately without taking time to ‘think’, making them useful in latency-sensitive use cases.

Three months ago, DeepSeek released V3 and we we wrote that there is a new leader in open source AI - noting that V3 came close to leading proprietary models from Anthropic and Google but did not surpass them.

Today, DeepSeek are not just releasing the best open source model - DeepSeek are now driving the frontier of non-reasoning open weights models, eclipsing all proprietary non-reasoning models, including Gemini 2.0 Pro, Claude 3.7 Sonnet and Llama 3.3 70B. This release is arguably even more impressive than R1 - and potentially indicates that R2 is going to be another significant leap forward.

Most other details are identical to the December 2024 version of DeepSeek V3, including:
➤ Context window: 128k (limited to 64k on DeepSeek’s first-party API)
➤ Total parameters: 671B (requires >700GB of GPU memory to run in native FP8 precision - still not something you can run at home!)
➤ Active parameters: 37B
➤ Native FP8 precision
➤Text only - no multimodal inputs or outputs
➤ MIT License

DeepSeek V3-0324 marks the first time an open weights model has been the leading non-reasoning model.

Feb 13 • 5 tweets • 4 min read

Announcing Artificial Analysis Intelligence Index V2 - the biggest upgrade to our eval suite yet

Summary of Intelligence Index V2:
➤ Harder evals: MMLU-Pro, HLE (Humanity's Last Exam), GPQA Diamond, MATH-500, AIME 2024, SciCode, and LiveCodeBench - see below for a description of each evaluation.
➤ Independent: As always, Artificial Analysis has independently run every eval on every model - no inconsistent lab-claim results anywhere to be seen
➤ Standardized: We evaluate models under identical conditions with consistent prompting, temperature settings and answer extraction techniques
➤ Extensive sensitivity testing: We’ve run every eval in Index V2 dozens of times in our pre-launch assessment phase to understand variability, and set the number of repeats we use to achieve our target confidence intervals
➤ More robust software stack: This one is a little inside baseball but is actually a pretty big deal - we’re running tens of thousands of queries on hundreds of models so our entire benchmarking stack has to be extremely robust, and allow our team to monitor evals for errors and anomalies so we can have confidence in every number published

Artificial Analysis has independently run thousands of evals across hundreds of models to support this launch - today, we already have Intelligence Index scores for all leading models published on our updated website.

For further information regarding how models perform, the evals we have chosen to include and our methodology, see below.

Deep-dive into the evals included in Intelligence Index V2

On the Artificial Analysis website we report all eval scores individually allowing you to understand the individual components of the index and understand model strengths and weaknesses.

Reasoning and Knowledge (50% weighting):
➤ MMLU Pro: Comprehensive evaluation of advanced knowledge across domains, adapted from original MMLU but focusing on harder questions and using a 10 option multi-choice format
➤ Humanity's Last Exam: Recent frontier academic benchmark from the Centre for AI Safety (led by Dan Hendrycks, @ai_risks)
➤ GPQA Diamond: Scientific knowledge and reasoning benchmark

Mathematical Reasoning (25% weighting):
➤ MATH-500: Mathematical problem-solving across various difficulty levels; a subset of 500 questions from Hendrycks' 2021 MATH dataset, created by OpenAI as a consequence of OpenAI training on ~90% of the original 5000 MATH questions for reinforcement learning on o1-series models
➤ AIME 2024: Advanced mathematical problem-solving dataset from the 2024 American Invitational Mathematics Examination

Code Generation and Comprehension (25% weighting):
➤ SciCode: Python programming to solve scientific computing tasks; we test with scientist-annotated background information included in the prompt and report the sub-problem score
➤ LiveCodeBench: Python programming to solve programming scenarios derived from LeetCode, AtCoder, and Codeforces; we test 315 problems from the 1 July 2024 to 1 Jan 2025 subset from release_v5

Jan 23 • 4 tweets • 2 min read

DeepSeek’s first reasoning model has arrived - over 25x cheaper than OpenAI’s o1

Highlights from our initial benchmarking of DeepSeek R1:
➤ Trades blows with OpenAI’s o1 across our eval suite to score the second highest in Artificial Analysis Quality Index ever
➤ Priced on DeepSeek’s own API at just $0.55/$2.19 input/output - significantly cheaper than not just o1 but o1-mini
➤ Served by DeepSeek at 71 output tokens/s (comparable to DeepSeek V3)
➤ Reasoning tokens are wrapped in tags, allowing developers to easily decide whether to show them to users

Stay tuned for more detail coming next week - big upgrades to the Artificial Analysis eval suite launching soon.

DeepSeek’s first party API is impressive: both faster and cheaper than the initial offerings from other leading inference providers serving R1.

DeepSeek’s API also offers a 70% off caching discount on repeated inputs (automatically applied).

Dec 27, 2024 • 4 tweets • 4 min read

There is a new leader in open source AI. Our independent benchmarks show China-based DeepSeek’s V3 model ahead of all open weights models released to date, beating OpenAI’s GPT-4o (Aug) and approaching Anthropic’s Claude 3.5 Sonnet (Oct).

DeepSeek V3 scores an Artificial Analysis Quality Index of 80, ahead of models like OpenAI’s GPT-4o and Meta’s Llama 3.3 70B. The only current models still ahead of DeepSeek are Google’s Gemini 2.0 Flash and OpenAI’s o1 series models. Landing ahead of Alibaba’s Qwen2.5 72B, DeepSeek is now 🇨🇳 China’s AI leader.

DeepSeek V3 uses an MoE architecture with 671B total parameters (37B active). The total parameter count is ~2.8x larger than DeepSeek V2.5.

Key benchmarking results:
➤ DeepSeek V3 outscores all leading open weights models in Artificial Analysis Quality Index, including Meta’s Llama 3.3 70B and Alibaba’s Qwen2.5 72B.
➤ DeepSeek V3 matches Anthropic’s Claude 3.5 Sonnet (Oct) and sits just below Google’s Gemini 2.0 Flash and OpenAI’s o1 series. Notably, DeepSeek V3 likely has particularly strong coding and mathematical reasoning capabilities with scores of 92% in HumanEval and 85% in MATH-500.
➤ DeepSeek’s first party API for V3 is fast, achieving an output speed of 89 tokens/sec — 4x faster than DeepSeek V2.5 (18 tokens/sec). In their Technical Report, DeepSeek discloses extensive inference optimization work they have undertaken to increase speed and efficiency for serving DeepSeek V3 on their H800 cluster. DeepSeek achieves this speed increase on a ~2.8x larger model, with only a modest increase in price (pricing details below).

Key training details:
➤ DeepSeek V3 was trained on 14.8T tokens in just 2.788M NVIDIA H800 GPU hours - implying a cost of $5.6M (based on rental pricing of NVIDIA H800 at $2/hr). That’s just 57 days on DeepSeek’s 2048 H800 cluster.
➤ DeepSeek used their DeepSeek-R1 reasoning inference model for distillation. While reasoning models like OpenAI’s o1 series may not suit many use cases due to their cost and latency, this is less of a barrier for generating training data. DeepSeek’s approach of using R1 for this purpose likely has been and will be used by all major labs in 2025.
➤ DeepSeek V3 was trained on a cluster of 2048 NVIDIA H800 GPUs. As a Chinese company, DeepSeek is limited in their ability to use H100s and other NVIDIA chips by export controls. A key limitation of H800s is the reduced interconnect bandwidth (300 GB/s vs. 900 GB/s) which can impact training performance as node-to-node communication is a bottleneck. DeepSeek in their paper discussed various ways of optimizing training including through writing their own communication kernels rather than using tensor parallelism and using mixed precision (FP8) training.

We assess DeepSeek V3 to be a highly significant release. It reflects @deepseek_ai's significant contribution to the open source AI community, as well as the continuation of the trend of Chinese AI labs ascending to a clear global second place behind the US.

Further analysis below.

DeepSeek have continued to price their first party API aggressively. DeepSeek V3 is priced slightly higher than models like GPT-4o mini and Gemini 1.5 Flash but much cheaper than the frontier models of comparable intelligence.

Combined with a compelling cached input pricing policy (90% discount for cache hits, turned on automatically), DeepSeek V3 is by far the most cost efficient frontier-class model.

Compared to DeepSeek V2.5 (USD per million tokens):
➤ Input price increased 2x ($0.14 → $0.27)
➤ Output price increased 4x ($0.28 → $1.10)

Note that while the Artificial Analysis site shows the published standard price, DeepSeek are offering V3 at a promotional rate at the same pricing as DeepSeek V2.5 until early February.

Jan 16, 2024 • 8 tweets • 3 min read

Excited to launch - providing performance benchmarking and analysis of AI models and API hosting providers.

We’re here to help people/orgs choose the best model and provider for real AI use cases.

Should you be using GPT-4 on Azure or Mixtral on Perplexity? We’ll help you decide.

A few highlights: 🧵ArtificialAnalysis.ai There’s a trade-off between model quality and throughput (speed), with higher quality models typically having lower throughput (slower output).

If you need maximum quality, GPT-4 is still the only game in town (watch this space though!).

Comparing between models: Quality vs. Throughput

Share this page!

Enter URL or ID to Unroll