Post

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @lmarena_ai

lmarena.ai (formerly lmsys.org)

@lmarena_ai

Mar 27

News: the latest ChatGPT-4o (2025-03-26) jumps to #2 on Arena, surpassing GPT-4.5!

Highlights
- Significant improvement over the January version (+30 pts, #5->#2)
- Tied #1 in Coding, Hard Prompts. Top-2 across ALL categories
- Matching or surpassing GPT-4.5 with 10x cheaper price.

Congrats @OpenAI for the new milestone! View below for more insights ⬇️

We saw clear leaps in improvements over the previous ChatGPT-4o release:

- Math #14 -> #2
- Hard Prompts #7 -> #1
- Coding #5 -> #1

Check out the ChatGPT-4o-latest and all frontier models at: lmarena.ai

Read 4 tweets

lmarena.ai (formerly lmsys.org)

@lmarena_ai

Mar 25

BREAKING: Gemini 2.5 Pro is now #1 on the Arena leaderboard - the largest score jump ever (+40 pts vs Grok-3/GPT-4.5)! 🏆

Tested under codename "nebula"🌌, Gemini 2.5 Pro ranked #1🥇 across ALL categories and UNIQUELY #1 in Math, Creative Writing, Instruction Following, Longer Query, and Multi-Turn!

Massive congrats to @GoogleDeepMind for this incredible Arena milestone! 🙌

More highlights in thread👇

Gemini 2.5 Pro #1 across ALL categories, tied #1 with Grok-3/GPT-4.5 for Hard Prompts and Coding, and edged out across all others to take the lead 🏇🏆

Gemini 2.5 Pro ranked #1 on the Vision Arena 🖼️ leaderboard!

Read 5 tweets

lmarena.ai (formerly lmsys.org)

@lmarena_ai

Feb 26

Introducing Prompt-to-leaderboard (P2L): a real-time LLM leaderboard tailored exactly to your use case!

P2L trains an LLM to generate "prompt-specific" leaderboards, so you can input a prompt and get a leaderboard specifically for that prompt.

The model is trained on the 2M human preference votes from Chatbot Arena.

P2L Highlights:
🔹Instant leaderboard for any prompt 🗿
🔹Optimal model routing (hit #1 on Chatbot Arena in Jan 2025 with 1395 score 🧏)
🔹Fine-grained model strength & weakness analysis 🤓

Check out our demo and thread below for more details!

Use case 1: Optimal Routing

If we know which models are best per-prompt, that makes optimal routing easy!

- Performance: P2L-router (experimental-router-0112) is #1 on Chatbot Arena in Jan 2025 with a score of 1395. (+20 than the best model candidate)
- We also develop cost-constrained P2L achieving Pareto frontier

Use case 2: Domain-Specific Leaderboards

P2L can aggregate rankings of prompts within a category to produce an adaptive category ranking →

e.g., Find the best models for SQL queries instantly!

Read 6 tweets

lmarena.ai (formerly lmsys.org)

@lmarena_ai

Feb 18

BREAKING: @xAI early version of Grok-3 (codename "chocolate") is now #1 in Arena! 🏆

Grok-3 is:
- First-ever model to break 1400 score!
- #1 across all categories, a milestone that keeps getting harder to achieve

Huge congratulations to @xAI on this milestone! View thread 🧵 for more insights into Grok-3's performance after ~8K votes in the Arena.

Here you can see @xai Grok-3’s performance across all the top categories:
🔹 Overall w/ Style Control
🔹 Hard Prompts & Hard Prompt w/ Style Control
🔹 Coding
🔹 Math
🔹 Creative Writing
🔹 Instruction Following
🔹 Longer Query
🔹 Multi-Turn

In the Coding category, Grok-3 surpassed top reasoning models like o1 and Gemini-thinking.

Read 5 tweets

lmarena.ai (formerly lmsys.org)

@lmarena_ai

Feb 7

Introducing Arena-Price Plot! 💰📊

An interactive plot of price vs. performance trade-offs for LLMs.

Frontier efficiency models:
🔹 Gemini-2.0-Flash/Lite by @GoogleDeepMind
🔹 DeepSeek-R1 by @deepseek_ai
🔹 GPT-4o by @OpenAI
🔹 Yi-Lightning by @01AI_Yi
🔹 Ministral 8B by @MistralAI

LLM efficiency is accelerating—kudos to the labs driving the frontier!

You can select organizations to highlight their models in Arena-Price Plot.

You can select a specific category (e.g, Coding, Math, Creative Writing, …) for your tasks.

Read 6 tweets

lmarena.ai (formerly lmsys.org)

@lmarena_ai

Jan 24

Breaking News: DeepSeek-R1 surges to the top-3 in Arena🐳!

Now ranked #3 Overall, matching the top reasoning model, o1, while being 20x cheaper and open-weight!

Highlights:
- #1 in technical domains: Hard Prompts, Coding, Math
- Joint #1 under Style Control
- MIT-licensed

A massive congrats to @deepseek_ai for this incredible milestone and gift to the community! More analysis below 👇

In Hard Prompt with Style Control, DeepSeek-R1 ranked joint #1 with o1.

Early results show DeepSeek-R1 strong across all domains! More votes are being collected for stable rankings.

Read 6 tweets

Share this page!

Enter URL or ID to Unroll

lmarena.ai (formerly lmsys.org)

Try unrolling a thread yourself!

More from @lmarena_ai

lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!