Post

More from @ArtificialAnlys

Artificial Analysis

@ArtificialAnlys

May 8

Google’s Gemini 2.5 Flash costs 150x more than Gemini 2.0 Flash to run Artificial Analysis Intelligence Index

The increase is driven by:
➤ 9x more expensive output tokens - $3.5 per million with reasoning on ($0.6 with reasoning off) vs $0.4 for Gemini 2.0 Flash
➤ 17x higher token usage across our evals due to adding reasoning - the greatest volume of tokens used in reasoning that we have observed for any model to date

This doesn’t mean Gemini 2.5 Flash is not a compelling value proposition - its 12 point bump in Artificial Analysis Intelligence Index makes it suitable for a range of use cases that may not perform sufficiently well on Gemini 2.0 Flash. With per-token pricing still slightly below OpenAI’s o4-mini, Gemini 2.5 Flash may still be a cost-effective option for certain use cases.

It does mean that Gemini 2.5 Flash with Reasoning may not be a clear upgrade for everyone - for many use cases, developers may want to stay with 2.0 Flash or use 2.5 Flash with reasoning off.

Breakdown of token usage, pricing and end-to-end latency.

See further details and other comparisons: artificialanalysis.ai/models?models=…

Read 4 tweets

Artificial Analysis

@ArtificialAnlys

Mar 28

Today’s GPT-4o update is actually big - it leapfrogs Claude 3.7 Sonnet (non-reasoning) and Gemini 2.0 Flash in our Intelligence Index and is now the leading non-reasoning model for coding

This makes GPT-4o the second highest scoring non-reasoning model (excludes o3-mini, Gemini 2.5 Pro, etc), coming in just behind DeepSeek’s V3 0324 release earlier this week.

Key benchmarking results:
➤ Significant jump in the Artificial Analysis Intelligence Index from 41 to 50, putting GPT-4o (March 2025) ahead of Claude 3.7 Sonnet
➤ Now the the leading non-reasoning model for coding: 🥇#1 in the Artificial Analysis Coding Index and in LiveCodeBench, surpassing DeepSeek V3 (March 2025) and Claude 3.7 Sonnet

@OpenAI has committed an all-new AI model naming sin of simply refusing to name the model at all, so we will be referring to it as GPT-4o (March 2025).

This update has also been released in a fairly confusing way - the March 2025 version of GPT-4o is currently available:
➤ In ChatGPT, when users select GPT-4o in the model selector
➤ Via API on the chatgpt-4o-latest endpoint - a non-dated endpoint that OpenAI described at launch as intended for research use only, with developers encouraged to use the dated snapshot versions of GPT-4o for most API use cases

As of today, this means that the chatgpt-4o-latest endpoint is serving a significantly better model than the proper API versions GPT-4o (ie. the August 2024 and November 2024 snapshots).

We recommend some caution for developers considering moving workloads to the chatgpt-4o-latest endpoint given OpenAI’s previous guidance, and note that OpenAI will likely release a dated API snapshot soon. We also note that OpenAI prices the chatgpt-4o-latest endpoint at $5/$15 per million input/output tokens, whereas the API snapshots are priced at $2.5/$10.

See below for further analysis 👇

GPT-4o (March 2025) is now the leading non-reasoning coding model, surpassing DeepSeek V3 (March 2025) and Claude 3.7 Sonnet in the Artificial Analysis Coding Index (made up of LiveCodeBench and SciCode) and is #1 in LiveCodeBench

GPT-4o (March 2025) still lags behind reasoning models, though these can be considered separately considering their higher latency and typically higher cost

Read 5 tweets

Artificial Analysis

@ArtificialAnlys

Mar 25

DeepSeek takes the lead: DeepSeek V3-0324 is now the highest scoring non-reasoning model

This is the first time an open weights model is the leading non-reasoning model, a milestone for open source.

DeepSeek V3-0324 has jumped forward 7 points in Artificial Analysis Intelligence Index, now sitting ahead of all other non-reasoning models. It sits behind DeepSeek’s own R1 in Intelligence Index, as well as other reasoning models from OpenAI, Anthropic and Alibaba, but this does not take away from the impressiveness of this accomplishment. Non-reasoning models answer immediately without taking time to ‘think’, making them useful in latency-sensitive use cases.

Three months ago, DeepSeek released V3 and we we wrote that there is a new leader in open source AI - noting that V3 came close to leading proprietary models from Anthropic and Google but did not surpass them.

Today, DeepSeek are not just releasing the best open source model - DeepSeek are now driving the frontier of non-reasoning open weights models, eclipsing all proprietary non-reasoning models, including Gemini 2.0 Pro, Claude 3.7 Sonnet and Llama 3.3 70B. This release is arguably even more impressive than R1 - and potentially indicates that R2 is going to be another significant leap forward.

Most other details are identical to the December 2024 version of DeepSeek V3, including:
➤ Context window: 128k (limited to 64k on DeepSeek’s first-party API)
➤ Total parameters: 671B (requires >700GB of GPU memory to run in native FP8 precision - still not something you can run at home!)
➤ Active parameters: 37B
➤ Native FP8 precision
➤Text only - no multimodal inputs or outputs
➤ MIT License

DeepSeek V3-0324 marks the first time an open weights model has been the leading non-reasoning model.

Compared to leading reasoning models, including DeepSeek’s own R1, DeepSeek V3-0324 remains behind - but for many uses, the increased latency associated with letting reasoning models ‘think’ before answering makes them unusable.

Read 4 tweets

Artificial Analysis

@ArtificialAnlys

Feb 13

Announcing Artificial Analysis Intelligence Index V2 - the biggest upgrade to our eval suite yet

Summary of Intelligence Index V2:
➤ Harder evals: MMLU-Pro, HLE (Humanity's Last Exam), GPQA Diamond, MATH-500, AIME 2024, SciCode, and LiveCodeBench - see below for a description of each evaluation.
➤ Independent: As always, Artificial Analysis has independently run every eval on every model - no inconsistent lab-claim results anywhere to be seen
➤ Standardized: We evaluate models under identical conditions with consistent prompting, temperature settings and answer extraction techniques
➤ Extensive sensitivity testing: We’ve run every eval in Index V2 dozens of times in our pre-launch assessment phase to understand variability, and set the number of repeats we use to achieve our target confidence intervals
➤ More robust software stack: This one is a little inside baseball but is actually a pretty big deal - we’re running tens of thousands of queries on hundreds of models so our entire benchmarking stack has to be extremely robust, and allow our team to monitor evals for errors and anomalies so we can have confidence in every number published

Artificial Analysis has independently run thousands of evals across hundreds of models to support this launch - today, we already have Intelligence Index scores for all leading models published on our updated website.

For further information regarding how models perform, the evals we have chosen to include and our methodology, see below.

Deep-dive into the evals included in Intelligence Index V2

On the Artificial Analysis website we report all eval scores individually allowing you to understand the individual components of the index and understand model strengths and weaknesses.

Reasoning and Knowledge (50% weighting):
➤ MMLU Pro: Comprehensive evaluation of advanced knowledge across domains, adapted from original MMLU but focusing on harder questions and using a 10 option multi-choice format
➤ Humanity's Last Exam: Recent frontier academic benchmark from the Centre for AI Safety (led by Dan Hendrycks, @ai_risks)
➤ GPQA Diamond: Scientific knowledge and reasoning benchmark

Mathematical Reasoning (25% weighting):
➤ MATH-500: Mathematical problem-solving across various difficulty levels; a subset of 500 questions from Hendrycks' 2021 MATH dataset, created by OpenAI as a consequence of OpenAI training on ~90% of the original 5000 MATH questions for reinforcement learning on o1-series models
➤ AIME 2024: Advanced mathematical problem-solving dataset from the 2024 American Invitational Mathematics Examination

Code Generation and Comprehension (25% weighting):
➤ SciCode: Python programming to solve scientific computing tasks; we test with scientist-annotated background information included in the prompt and report the sub-problem score
➤ LiveCodeBench: Python programming to solve programming scenarios derived from LeetCode, AtCoder, and Codeforces; we test 315 problems from the 1 July 2024 to 1 Jan 2025 subset from release_v5

Artificial Analysis Intelligence Index runs on both reasoning and non-reasoning models. Our Intelligence Index clearly shows reasoning models outperforming non-reasoning and DeepSeek R1 rivaling OpenAI’s o1 and o3-mini.

Read 5 tweets

Artificial Analysis

@ArtificialAnlys

Jan 23

DeepSeek’s first reasoning model has arrived - over 25x cheaper than OpenAI’s o1

Highlights from our initial benchmarking of DeepSeek R1:
➤ Trades blows with OpenAI’s o1 across our eval suite to score the second highest in Artificial Analysis Quality Index ever
➤ Priced on DeepSeek’s own API at just $0.55/$2.19 input/output - significantly cheaper than not just o1 but o1-mini
➤ Served by DeepSeek at 71 output tokens/s (comparable to DeepSeek V3)
➤ Reasoning tokens are wrapped in tags, allowing developers to easily decide whether to show them to users

Stay tuned for more detail coming next week - big upgrades to the Artificial Analysis eval suite launching soon.

DeepSeek’s first party API is impressive: both faster and cheaper than the initial offerings from other leading inference providers serving R1.

DeepSeek’s API also offers a 70% off caching discount on repeated inputs (automatically applied).

Compared to non-reasoning models, DeepSeek R1 takes a long time to begin returning output tokens.

Read 4 tweets

Artificial Analysis

@ArtificialAnlys

Dec 27, 2024

There is a new leader in open source AI. Our independent benchmarks show China-based DeepSeek’s V3 model ahead of all open weights models released to date, beating OpenAI’s GPT-4o (Aug) and approaching Anthropic’s Claude 3.5 Sonnet (Oct).

DeepSeek V3 scores an Artificial Analysis Quality Index of 80, ahead of models like OpenAI’s GPT-4o and Meta’s Llama 3.3 70B. The only current models still ahead of DeepSeek are Google’s Gemini 2.0 Flash and OpenAI’s o1 series models. Landing ahead of Alibaba’s Qwen2.5 72B, DeepSeek is now 🇨🇳 China’s AI leader.

DeepSeek V3 uses an MoE architecture with 671B total parameters (37B active). The total parameter count is ~2.8x larger than DeepSeek V2.5.

Key benchmarking results:
➤ DeepSeek V3 outscores all leading open weights models in Artificial Analysis Quality Index, including Meta’s Llama 3.3 70B and Alibaba’s Qwen2.5 72B.
➤ DeepSeek V3 matches Anthropic’s Claude 3.5 Sonnet (Oct) and sits just below Google’s Gemini 2.0 Flash and OpenAI’s o1 series. Notably, DeepSeek V3 likely has particularly strong coding and mathematical reasoning capabilities with scores of 92% in HumanEval and 85% in MATH-500.
➤ DeepSeek’s first party API for V3 is fast, achieving an output speed of 89 tokens/sec — 4x faster than DeepSeek V2.5 (18 tokens/sec). In their Technical Report, DeepSeek discloses extensive inference optimization work they have undertaken to increase speed and efficiency for serving DeepSeek V3 on their H800 cluster. DeepSeek achieves this speed increase on a ~2.8x larger model, with only a modest increase in price (pricing details below).

Key training details:
➤ DeepSeek V3 was trained on 14.8T tokens in just 2.788M NVIDIA H800 GPU hours - implying a cost of $5.6M (based on rental pricing of NVIDIA H800 at $2/hr). That’s just 57 days on DeepSeek’s 2048 H800 cluster.
➤ DeepSeek used their DeepSeek-R1 reasoning inference model for distillation. While reasoning models like OpenAI’s o1 series may not suit many use cases due to their cost and latency, this is less of a barrier for generating training data. DeepSeek’s approach of using R1 for this purpose likely has been and will be used by all major labs in 2025.
➤ DeepSeek V3 was trained on a cluster of 2048 NVIDIA H800 GPUs. As a Chinese company, DeepSeek is limited in their ability to use H100s and other NVIDIA chips by export controls. A key limitation of H800s is the reduced interconnect bandwidth (300 GB/s vs. 900 GB/s) which can impact training performance as node-to-node communication is a bottleneck. DeepSeek in their paper discussed various ways of optimizing training including through writing their own communication kernels rather than using tensor parallelism and using mixed precision (FP8) training.

We assess DeepSeek V3 to be a highly significant release. It reflects @deepseek_ai's significant contribution to the open source AI community, as well as the continuation of the trend of Chinese AI labs ascending to a clear global second place behind the US.

Further analysis below.

DeepSeek have continued to price their first party API aggressively. DeepSeek V3 is priced slightly higher than models like GPT-4o mini and Gemini 1.5 Flash but much cheaper than the frontier models of comparable intelligence.

Combined with a compelling cached input pricing policy (90% discount for cache hits, turned on automatically), DeepSeek V3 is by far the most cost efficient frontier-class model.

Compared to DeepSeek V2.5 (USD per million tokens):
➤ Input price increased 2x ($0.14 → $0.27)
➤ Output price increased 4x ($0.28 → $1.10)

Note that while the Artificial Analysis site shows the published standard price, DeepSeek are offering V3 at a promotional rate at the same pricing as DeepSeek V2.5 until early February.

DeepSeek V3’s multilingual performance is also strong, consistently outscoring other open weights models across all languages we benchmark.

Read 4 tweets

Share this page!

Enter URL or ID to Unroll

Artificial Analysis

Try unrolling a thread yourself!

More from @ArtificialAnlys

Artificial Analysis

Artificial Analysis

Artificial Analysis

Artificial Analysis

Artificial Analysis

Artificial Analysis

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!