Post

Artificial Analysis

Nov 17 • 8 tweets • 7 min read • Read on X

Scrolly

Announcing AA-Omniscience, our new benchmark for knowledge and hallucination across >40 topics, where all but three models are more likely to hallucinate than give a correct answer

Embedded knowledge in language models is important for many real world use cases. Without knowledge, models make incorrect assumptions and are limited in their ability to operate in real world contexts. Tools like web search can support but models need to know what to search for (e.g. models should not search for ‘Multi Client Persistence’ for an MCP query when it clearly refers to ‘Model Context Protocol’).

Hallucination of factual information is a barrier to being able to rely on models and has been perpetuated by every major evaluation dataset. Grading correct answers with no penalty for incorrect answers creates an incentive for models (and the labs training them) to attempt every question. This problem is clearest when it comes to knowledge: factual information should never be made up, while in other contexts attempts that might not work are useful (e.g. coding new features).

Omniscience Index is the the key metric we report for AA-Omniscience, and it punishes hallucinations by deducting points where models have guessed over admitting they do not know the answer. AA-Omniscience shows that all but three models are more likely to hallucinate than provide a correct answer when given a difficult question. AA-Omniscience will complement the Artificial Analysis Intelligence Index to incorporate measurement of knowledge and probability of hallucination.

Details below, and more charts in the thread.

AA-Omniscience details:

- 🔢6,000 questions across 42 topics within 6 domains (’Business’, ‘Humanities & Social Sciences’, ‘Health’, ‘Law’, ‘Software Engineering’, and ‘Science, Engineering & Mathematics’)
- 🔍 89 sub-topics including Python data libraries, Public Policy, Taxation, and more, giving a sharper view of where models excel and where they fall short across nuanced domains
- 🔄 Incorrect answers are penalized in our Knowledge Reliability Index metrics to punish hallucinations
- 📊3 Metrics: Accuracy (% correct), Hallucination rate (% incorrect of incorrect/abstentions), Omniscience Index (+1 for correct, -1 for incorrect where answered, 0 for abstentions where the model did not try to answer)
- 🤗 Open source test dataset: We’re open sourcing 600 questions (10%) to support labs develop factual and reliable models. Topic distribution and model performance follows the full set (@huggingface link below)
- 📃 Paper: See below for a link to the research paper

Key findings:

- 🥇 Claude 4.1 Opus takes first place in Omniscience Index, followed by last week’s GPT-5.1 and Grok 4: Even the best frontier models score only slightly above 0, meaning they produce correct answers on the difficult questions that make up AA-Omniscience only marginally more often than incorrect ones. @AnthropicAI’s leadership is driven by low hallucination rate, whereas OpenAI and xAI’s positions are primarily driven by higher accuracy (percentage correct).

- 🥇 xAI’s Grok 4 takes first place in Omniscience Accuracy (our simple ‘percentage correct’ metric), followed by GPT-5 and Gemini 2.5 Pro: @xai's win may be enabled by scaling total parameters and pre-training compute: @elonmusk revealed last week that Grok 4 has 3 trillion total parameters, which may be larger than GPT-5 and other proprietary models

- 🥇 Claude sweeps the hallucination leaderboard: Anthropic takes the top three spots for lowest hallucination rate, with Claude 4.5 Haiku leading at 28%, over three times lower than GPT-5 (high) and Gemini 2.5 Pro. Claude 4.5 Sonnet and Claude 4.1 Opus follow in second and third at 48%

- 💭 High knowledge does not guarantee low hallucination: Hallucination rate measures how often a model guesses when it lacks the required knowledge. Models with the highest accuracy, including the GPT-5 models and Gemini 2.5 Pro, do not lead the Omniscience Index due to their tendency to guess over abstaining. Anthropic models tend to manage uncertainty better, with Claude 4.5 Haiku achieving the lowest hallucination rate at 26%, ahead of 4.5 Sonnet and 4.1 Opus (48%)

- 📊 Models vary by domain: Models differ in their performance across the six domains of AA-Omniscience - no model dominates across all. While Anthropic’s Claude 4.1 Opus leads in Law, Software Engineering, and Humanities & Social Sciences, GPT-5.1 from @OpenAI achieves the highest reliability on Business questions, and xAI’s Grok 4 performs best in Health and in Science, Engineering & Mathematics. Model choice should align with the the use case rather than choosing the overall leader

- 📈 Larger models score higher on accuracy, but not always reliability: Larger models tend to have higher levels of embedded knowledge, with Kimi K2 Thinking and DeepSeek R1 (0528) topping accuracy charts over smaller models. This advantage does not always hold on the Omniscience Index. For example, Llama 3.1 405B from @AIatMeta beats larger Kimi K2 variants due to having one of the lowest hallucination rates among models (51%)

Grok 4 by @xai, GPT-5 by @OpenAI and Gemini 2.5 Pro by @GoogleDeepMind achieve the highest accuracy in AA-Omniscience. The reason they do not achieve the highest Omniscience Index due to the low hallucination rates of @AnthropicAI’s Claude models

@AnthropicAI takes the top three spots for lowest hallucination rate, with Claude 4.5 Haiku leading at 28%, over three times lower than GPT-5 (high) and Gemini 2.5 Pro. Claude 4.5 Sonnet and Claude 4.1 Opus follow in second and third at 48%

Models with the highest accuracy, including Grok 4, GPT-5.1 and Gemini 2.5 Pro, do not lead the Omniscience Index due to their tendency to guess over abstaining. Claude 4.1 Opus has the best balance of accuracy (31%) and hallucination (48%), giving it the highest score in the Omniscience Index

Models differ in their performance across the six domains of AA-Omniscience - no model dominates across all. While @AnthropicAI's Claude 4.1 Opus leads in Law, Software Engineering, and Humanities & Social Sciences, GPT-5.1 from @OpenAI achieves the highest Omniscience Index on Business questions, and @xai's Grok 4 performs best in Health and Science, Engineering & Mathematics. Model choice should align with the the use case rather than choosing the overall leader

Larger models tend to have higher levels of embedded knowledge, with Kimi K2 Thinking and DeepSeek R1 (0528) topping accuracy charts over smaller models. This advantage does not always hold on the Omniscience Index. For example, Llama 3.1 405B from @AIatMeta beats larger Kimi K2 variants due to having one of the lowest hallucination rates among models (51%)

Read more about the evaluation and methodology in our AA-Omniscience paper (published arXiv link coming later today):
huggingface.co/datasets/Artif…

Explore sample questions and evaluate your model on the public set of AA-Omniscience with our HuggingFace dataset:
huggingface.co/datasets/Artif…

See details AA-Omniscience results on Artificial Analysis:
artificialanalysis.ai/evaluations/om…

The AA-Omniscience paper is now live on arXiv: arxiv.org/abs/2511.13029

• • •

Missing some Tweet in this thread? You can try to force a refresh

More from @ArtificialAnlys

Artificial Analysis

@ArtificialAnlys

Nov 18

Gemini 3 Pro is the new leader in AI. Google has the leading language model for the first time, with Gemini 3 Pro debuting +3 points above GPT-5.1 in our Artificial Analysis Intelligence Index

@GoogleDeepMind gave us pre-release access to Gemini 3 Pro Preview. The model outperforms all other models in Artificial Analysis Intelligence Index. It demonstrates strength across the board, coming in first in 5 of the 10 evaluations that make up Intelligence Index. Despite these intelligence gains, Gemini 3 Pro Preview shows improved token efficiency from Gemini 2.5 Pro, using significantly fewer tokens on the Intelligence Index than other leading models such as Kimi K2 Thinking and Grok 4. However, given its premium pricing ($2/$12 per million input/output tokens for <200K context), Gemini 3 Pro is among the most expensive models to run our Intelligence Index evaluations.

Key takeaways:

📖 Leading intelligence: Gemini 3 Pro Preview is the leading model in 5 of 10 evals in the Artificial Analysis Intelligence Index, including GPQA Diamond, MMLU-Pro, HLE, LiveCodeBench and SciCode. Its score of 37% on Humanity’s Last Exam is particularly impressive, improving on the previous best model by more than 10 percentage points. It also is leading in AA-Omniscience, Artificial Analysis’ new knowledge and hallucination evaluation, coming first in both Omniscience Index (our lead metric that takes off points for incorrect answers) and Omniscience Accuracy (percentage correct). Given that factual recall correlates closely with model size, this may point to Gemini 3 Pro being a much larger model than its competitors

💻 Advanced coding and agentic capabilities: Gemini 3 Pro Preview leads two of the three coding evaluations in the Artificial Analysis Intelligence Index, including an impressive 56% in SciCode, an improvement of over 10 percentage points from the previous highest score. It is also strong in agentic contexts, achieving the second highest score in Terminal-Bench Hard and Tau2-Bench Telecom

🖼️ Multimodal capabilities: Gemini 3 Pro Preview is a multi-modal model, with the ability to take text, images, video and audio as input. It scores the highest of any model on MMMU-Pro, a benchmark that tests reasoning abilities with image inputs. Google now occupies the first, third and fourth position in our MMMU-Pro leaderboard (with GPT-5.1 taking out second place just last week)

💲Premium Pricing: To measure cost, we report Cost to Run the Artificial Analysis Intelligence Index, which combines input and output token prices with token efficiency to reflect true usage cost. Despite the improvement in token efficiency from Gemini 2.5 Pro, Gemini 3 Pro Preview costs more to run. Its higher token pricing of $2/$12 USD per million input/output tokens (≤200k token context) results in a 12% increase in the cost to run the Artificial Analysis Intelligence Index compared to its predecessor, and the model is among the most expensive to run on our Intelligence Index. Google also continues to price long context workloads higher than lower context workloads, charging $4/$18 per million input/output tokens for ≥200k token context.

⚡ Speed: Gemini 3 Pro Preview has comparable speeds to Gemini 2.5 Pro, with 128 output tokens per second. This places it ahead of other frontier models including GPT-5.1 (high), Kimi K2 Thinking and Grok 4. This is potentially supported by Google’s first-party TPU accelerators

Other details: Gemini 3 Pro Preview has a 1 million token context window, and includes support for tool calling, structured outputs, and JSON mode

See below for further analysis

For the first time, Google has the most intelligent model, with Gemini 3 Pro Preview improving on the previous most intelligent model, OpenAI’s GPT-5.1 (high), by 3 points

Gemini 3 Pro Preview takes the top spot on the Artificial Analysis Omniscience Index, our new benchmark for measuring knowledge and hallucination across domains. Gemini 3 Pro Preview comes in first for both Omniscience Index (our lead metric that takes off points for incorrect answers) and Omniscience Accuracy (percentage correct).

Its win in Accuracy is actually much larger than than its overall Index win - this is driven by a higher Hallucination Rate than other models (88%).

We have previously shown that Omniscience Accuracy is closely correlated with model size (total parameter count). Gemini 3 Pro’s significant lead in this metric may point to it being a much larger model than its competitors.

Read 10 tweets

Artificial Analysis

@ArtificialAnlys

Nov 6

Inworld TTS 1 Max is the new leader on the Artificial Analysis Speech Arena Leaderboard, surpassing MiniMax’s Speech-02 series and OpenAI’s TTS-1 series

The Artificial Analysis Speech Arena ranks leading Text to Speech models based on human preferences. In the arena, users compare two pieces of generated speech side by side and select their preferred output without knowing which models created them. The speech arena includes prompts across four real-world categories of prompts: Customer Service, Knowledge Sharing, Digital Assistants, and Entertainment.

Inworld TTS 1 Max and Inworld TTS 1 both support 12 languages including English, Spanish, French, Korean, and Chinese, and voice cloning from 2-15 seconds of audio. Inworld TTS 1 processes ~153 characters per second of generation time on average, with the larger model, Inworld TTS 1 Max processing ~69 characters on average. Both models also support voice tags, allowing users to add emotion, delivery style, and non-verbal sounds, such as “whispering”, “cough”, and “surprised”.

Both TTS-1 and TTS-1-Max are transformer-based, autoregressive models employing LLaMA-3.2-1B and LLaMA-3.1-8B respectively as their SpeechLM backbones.

See the leading models in the Speech Arena, and listen to sample clips below 🎧

Sample prompt on Inworld TTS 1 Max: “Your gut microbiome contains trillions of bacteria that influence digestion, immunity, and even mental health through the gut-brain axis.”

Inworld TTS 1 processes ~153 characters per second of generation time on average, with Inworld TTS 1 Max processing ~69 characters on average.

Read 4 tweets

Artificial Analysis

@ArtificialAnlys

Oct 2

IBM has launched Granite 4.0 - a new family of open weights language models ranging in size from 3B to 32B. Artificial Analysis was provided pre-release access, and our benchmarking shows Granite 4.0 H Small (32B/9B total/active parameters) scoring an Intelligence Index of 23, with a particular strength in token efficiency

Today IBM released four new models: Granite 4.0 H Small (32B/9B total/active parameters), Granite 4.0 H Tiny (7B/1B), Granite 4.0 H Micro (3B/3B) and Granite 4.0 Micro (3B/3B). We evaluated Granite 4.0 Small (in non-reasoning mode) and Granite 4.0 Micro using the Artificial Analysis Intelligence Index. Granite 4.0 models combine a small amount of standard transformer-style attention layers with a majority of Mamba layers which claims to reduce memory requirements without impacting performance

Key benchmarking takeaways:
➤🧠 Granite 4.0 H Small Intelligence: In non-reasoning, Granite 4.0 H Small scores 23 on the Artificial Analysis Intelligence index - a jump of +8 points on the Index compared to IBM Granite 3.3 8B (Non Reasoning). Granite 4.0 H Small places ahead of Gemma 3 27B (22) but behind Mistral Small 3.2 (29), EXAONE 4.0 32B (Non-Reasoning, 30) and Qwen3 30B A3B 2507 (Non-Reasoning, 37) in intelligence
➤⚡ Granite 4.0 Micro Intelligence: On the Artificial Analysis Intelligence Index, Granite 4.0 Micro scores 16. It places ahead of Gemma 3 4B (15) and LFM 2 2.6B (12).
➤⚙️ Token efficiency: Granite 4.0 H Small and Micro demonstrate impressive token efficiency - Granite 4.0 Small uses 5.2M, while Granite 4.0 Micro uses 6.7M tokens to run the Artificial Analysis Intelligence Index. Both models fewer tokens than Granite 3.3 8B (Non-Reasoning) and most other open weights non-reasoning models smaller than 40B total parameters (except Qwen3 0.6B which uses 1.9M output tokens)

Key model details:
➤🌐 Availability: All four models are available on Hugging Face. Granite 4.0 H Small is available on Replicate and is priced at $0.06/$0.25 per 1M input/output tokens
➤📏 Context Window: 128K tokens
➤©️ Licensing: The Granite 4.0 models are available under the Apache 2.0 license

Granite 4.0 H Small’s (Non Reasoning) output token efficiency and per token pricing offers a compelling tradeoff between intelligence and Cost to Run Artificial Analysis Intelligence Index

In the category of Open Weights Non-Reasoning models smaller than 40B total parameters, Granite 4.0 H Small is on the frontier tradeoff between intelligence and Output Tokens Used in Artificial Analysis Intelligence Index

Read 5 tweets

Artificial Analysis

@ArtificialAnlys

Oct 1

Reve V1 debuts at #3 in the Artificial Analysis Image Editing Leaderboard, trailing only Gemini 2.5 Flash (Nano-Banana) and Seedream 4.0!

Reve V1 is the first image editing model from Reve AI, and is built on their latest text to image model. The Reve V1 model supports both single and multi-image edits, with the ability to combine multiple reference images into a single output image.

The model is available via the Reve web app, which offers free access with a daily usage limit, or expanded usage through their Pro plan at $20/month.

Reve V1 is also accessible via the Reve API Beta priced at $40/1k images, similar to competitors like Gemini 2.5 Flash ($39/1k) and Seedream 4.0 ($30/1k).

See the Reve V1 Image Editing model for yourself in the thread below 🧵

[Prompt 1/5] Change the sign to state "SCHOOL Zone Ahead”

[Prompt 2/5] Change the stroke to freestyle

Read 7 tweets

Artificial Analysis

@ArtificialAnlys

Aug 7

OpenAI gave us early access to GPT-5: our independent benchmarks verify a new high for AI intelligence. We have tested all four GPT-5 reasoning effort levels, revealing 23x differences in token usage and cost between the ‘high’ and ‘minimal’ options and substantial differences in intelligence

We have run our full suite of eight evaluations independently across all reasoning effort configurations of GPT-5 and are reporting benchmark results for intelligence, token usage, and end-to-end latency.

What @OpenAI released: OpenAI has released a single endpoint for GPT-5, but different reasoning efforts offer vastly different intelligence. GPT-5 with reasoning effort “High” reaches a new intelligence frontier, while “Minimal” is near GPT-4.1 level (but more token efficient).

Takeaways from our independent benchmarks:
⚙️ Reasoning effort configuration: GPT-5 offers four reasoning effort configurations: high, medium, low, and minimal. Reasoning effort options steer the model to “think” more or less hard for each query, driving large differences in intelligence, token usage, speed, and cost.

🧠 Intelligence achieved ranges from frontier to GPT-4.1 level: GPT-5 sets a new standard with a score of 68 on our Artificial Analysis Intelligence Index (MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME, IFBench & AA-LCR) at High reasoning effort. Medium (67) is close to o3, Low (64) sits between DeepSeek R1 and o3, and Minimal (44) is close to GPT-4.1. While High sets a new standard, the increase over o3 is not comparable to the jump from GPT-3 to GPT-4 or GPT-4o to o1.

💬 Token usage varies 23x between reasoning efforts: GPT-5 with High reasoning effort used more tokens than o3 (82M vs. 50M) to complete our Index, but still fewer than Gemini 2.5 Pro (98M) and DeepSeek R1 0528 (99M). However, Minimal reasoning effort used only 3.5M tokens which is substantially less than GPT-4.1, making GPT-5 Minimal significantly more token-efficient for similar intelligence.

📖 Long Context Reasoning: We released our own Long Context Reasoning (AA-LCR) benchmark earlier this week to test the reasoning capabilities of models across long sequence lengths (sets of documents ~100k tokens in total). GPT-5 stands out for its performance in AA-LCR, with GPT-5 in both High and Medium reasoning efforts topping the benchmark.

🤖 Agentic Capabilities: OpenAI also commented on improvements across capabilities increasingly important to how AI models are used, including agents (long horizon tool calling). We recently added IFBench to our Intelligence Index to cover instruction following and will be adding further evals to cover agentic tool calling to independently test these capabilities.

📡 Vibe checks: We’re testing the personality of the model through MicroEvals on our website which supports running the same prompt across models and comparing results. It’s free to use, we’ll provide an update with our perspective shortly but feel free to share your own!

See below for further analysis:

Token usage (verbosity): GPT-5 with reasoning effort high uses 23X more tokens than with reasoning effort minimal. Though in doing so achieves substantial intelligence gains, between medium and high there is less of an uplift.

Individual intelligence benchmark results: GPT-5 performs well across our intelligence evaluations.

Read 5 tweets

Artificial Analysis

@ArtificialAnlys

Aug 6

Independent benchmarks of OpenAI’s gpt-oss models: gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits

OpenAI has released two versions of gpt-oss:
➤ gpt-oss-120b (116.8B total parameters, 5.1B active parameters): Intelligence Index score of 58
➤ gpt-oss-20b (20.9B total parameters, 3.6B active parameters): Intelligence Index score of 48

Size & deployment: OpenAI has released both models in MXFP4 precision: gpt-oss-120b comes in at just 60.8GB and gpt-oss-20b just 12.8GB. This means that the 120B can be run in its native precision on a single NVIDIA H100, and the 20B can be run easily on a consumer GPU or laptop with >16GB of RAM. Additionally, the relatively small proportion of active parameters will contribute to their efficiency and speed for inference: the 5.1B active parameters of the 120B model can be contrasted with Llama 4 Scout’s 109B total parameters and 17B active (a lot less sparse). This makes it possible to get dozens of output tokens/s for the 20B on recent MacBooks.

Intelligence: Both models score extremely well for their size and sparsity. We’re seeing the 120B beat o3-mini but come in behind o4-mini and o3. The 120B is the most intelligent model that can be run on a single H100 and the 20B is the most intelligent model that can be run on a consumer GPU. Both models appear to place similiarly across most of our evals, indicating no particular areas of weakness.

Comparison to other open weights models: While the larger gpt-oss-120b does not come in above DeepSeek R1 0528’s score of 59 or Qwen3 235B 2507s score of 64, it is notable that it is significantly smaller in both total and active parameters than both of those models. DeepSeek R1 has 671B total parameters and 37B active parameters, and is released natively in FP8 precision, making its total file size (and memory requirements) over 10x larger than gpt-oss-120b. Both gpt-oss-120b and 20b are text-only models (similar to competing models from DeepSeek, Alibaba and others).

Architecture: The MoE architecture at appears fairly standard. The MoE router selects the top 4 experts for each token generation. The 120B has 36 layers and 20B has 24 layers. Each layer has 64 query heads, uses Grouped Query Attention with 8 KV heads. Rotary embeddings and YaRN are used to extend context window to 128k. The 120B model activates 4.4% of total parameters per forward pass, whereas the 20B model activates 17.2% of total parameters. This may indicate that OpenAI’s perspective is that a higher degree is of sparsity is optimal for larger models. It has been widely speculated that most top models from frontier labs have been sparse MoEs for most releases since GPT-4.

API Providers: A number of inference providers have been quick to launch endpoints. We are currently benchmarking @GroqInc, @CerebrasSystems, @FireworksAI_HQ and @togethercompute on Artificial Analysis and will add more providers as they launch endpoints.

Pricing: We’re tracking median pricing across API providers of $0.15/$0.69 per million input/output tokens for the 120B and $0.08/$0.35 for the 20B. These prices put the 120B close to 10x cheaper than OpenAI’s proprietary APIs for o4-mini ($1.1/$4.4) and o3 ($2/$8).

License: Apache 2.0 license - very permissive!

See below for further analysis:

Intelligence vs. Total Parameters: gpt-oss-120B is the most intelligence model that can fit on a single H100 GPU in its native precision.

Pricing: Across the API providers who have launched day one API coverage, we’re seeing median prices of $0.15/$0.69 per million input/output tokens for the 120B and $0.08/$0.35 for the 20B. This makes both gpt-oss models highly cost efficient options for developers.

Read 6 tweets

Share this page!

Enter URL or ID to Unroll

Artificial Analysis

Try unrolling a thread yourself!

More from @ArtificialAnlys

Artificial Analysis

Artificial Analysis

Artificial Analysis

Artificial Analysis

Artificial Analysis

Artificial Analysis

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!