@GoogleDeepMind's new Gemini 1.5 Pro (Experimental 0801) has been tested in Arena for the past week, gathering over 12K community votes.
For the first time, Google Gemini has claimed the #1 spot, surpassing GPT-4o/Claude-3.5 with an impressive score of 1300 (!), and also achieving #1 on our Vision Leaderboard.
Gemini 1.5 Pro (0801) excels in multi-lingual tasks and delivers robust performance in technical areas like Math, Hard Prompts, and Coding.
Huge congrats to @GoogleDeepMind on this remarkable milestone!
News: the latest ChatGPT-4o (2025-03-26) jumps to #2 on Arena, surpassing GPT-4.5!
Highlights
- Significant improvement over the January version (+30 pts, #5->#2)
- Tied #1 in Coding, Hard Prompts. Top-2 across ALL categories
- Matching or surpassing GPT-4.5 with 10x cheaper price.
Congrats @OpenAI for the new milestone! View below for more insights ⬇️
We saw clear leaps in improvements over the previous ChatGPT-4o release:
- Math #14 -> #2
- Hard Prompts #7 -> #1
- Coding #5 -> #1
Check out the ChatGPT-4o-latest and all frontier models at: lmarena.ai
BREAKING: Gemini 2.5 Pro is now #1 on the Arena leaderboard - the largest score jump ever (+40 pts vs Grok-3/GPT-4.5)! 🏆
Tested under codename "nebula"🌌, Gemini 2.5 Pro ranked #1🥇 across ALL categories and UNIQUELY #1 in Math, Creative Writing, Instruction Following, Longer Query, and Multi-Turn!
Massive congrats to @GoogleDeepMind for this incredible Arena milestone! 🙌
More highlights in thread👇
Gemini 2.5 Pro #1 across ALL categories, tied #1 with Grok-3/GPT-4.5 for Hard Prompts and Coding, and edged out across all others to take the lead 🏇🏆
Gemini 2.5 Pro ranked #1 on the Vision Arena 🖼️ leaderboard!
Introducing Prompt-to-leaderboard (P2L): a real-time LLM leaderboard tailored exactly to your use case!
P2L trains an LLM to generate "prompt-specific" leaderboards, so you can input a prompt and get a leaderboard specifically for that prompt.
The model is trained on the 2M human preference votes from Chatbot Arena.
P2L Highlights:
🔹Instant leaderboard for any prompt 🗿
🔹Optimal model routing (hit #1 on Chatbot Arena in Jan 2025 with 1395 score 🧏)
🔹Fine-grained model strength & weakness analysis 🤓
Check out our demo and thread below for more details!
Use case 1: Optimal Routing
If we know which models are best per-prompt, that makes optimal routing easy!
- Performance: P2L-router (experimental-router-0112) is #1 on Chatbot Arena in Jan 2025 with a score of 1395. (+20 than the best model candidate)
- We also develop cost-constrained P2L achieving Pareto frontier
Use case 2: Domain-Specific Leaderboards
P2L can aggregate rankings of prompts within a category to produce an adaptive category ranking →
e.g., Find the best models for SQL queries instantly!
BREAKING: @xAI early version of Grok-3 (codename "chocolate") is now #1 in Arena! 🏆
Grok-3 is:
- First-ever model to break 1400 score!
- #1 across all categories, a milestone that keeps getting harder to achieve
Huge congratulations to @xAI on this milestone! View thread 🧵 for more insights into Grok-3's performance after ~8K votes in the Arena.
Here you can see @xai Grok-3’s performance across all the top categories:
🔹 Overall w/ Style Control
🔹 Hard Prompts & Hard Prompt w/ Style Control
🔹 Coding
🔹 Math
🔹 Creative Writing
🔹 Instruction Following
🔹 Longer Query
🔹 Multi-Turn
In the Coding category, Grok-3 surpassed top reasoning models like o1 and Gemini-thinking.
An interactive plot of price vs. performance trade-offs for LLMs.
Frontier efficiency models:
🔹 Gemini-2.0-Flash/Lite by @GoogleDeepMind
🔹 DeepSeek-R1 by @deepseek_ai
🔹 GPT-4o by @OpenAI
🔹 Yi-Lightning by @01AI_Yi
🔹 Ministral 8B by @MistralAI
LLM efficiency is accelerating—kudos to the labs driving the frontier!
You can select organizations to highlight their models in Arena-Price Plot.
You can select a specific category (e.g, Coding, Math, Creative Writing, …) for your tasks.