Latest Twitter Threads by @arena on Thread Reader App

Aug 7, 2025 • 6 tweets • 2 min read

GPT-5 is here - and it’s #1 across the board.

🥇#1 in Text, WebDev, and Vision Arena
🥇#1 in Hard Prompts, Coding, Math, Creativity, Long Queries, and more

Tested under the codename “summit”, GPT-5 now holds the highest Arena score to date.

Huge congrats to @OpenAI on this record-breaking achievement!

In WebDev Arena, GPT-5 sets a new record:

+75 pts over Gemini 2.5 Pro
+100 pts over Claude Opus 4

The best model to date for real-world coding.

Jun 5, 2025 • 5 tweets • 2 min read

🚨Breaking: New Gemini-2.5-Pro (06-05) takes the #1 spot across all Arenas again!

🥇 #1 in Text, Vision, WebDev
🥇 #1 in Hard, Coding, Math, Creative, Multi-turn, Instruction Following, and Long Queries categories

Huge congrats @GoogleDeepMind!

https://twitter.com/GoogleDeepMind/status/1930656243346976925

New Gemini-2.5-Pro (06-05) ranks #1 in WebDev Arena (+35 pts from previous 2.5 Pro)

May 21, 2025 • 4 tweets • 2 min read

📢We’re excited to share that we’ve raised $100M in seed funding to support LMArena and continue our research on reliable AI. Led by @a16z and UC Investments (@UofCalifornia), we're proud to have the support of those that believe in both the science and the mission.

We’re focused on building a neutral, open, community-driven platform that helps the world understand and improve the performance of AI models on real queries from real users.

Also, big news is coming next week!👀

We're relaunching LMArena with a whole new look built directly with community feedback from the ground up 🧱 Link in thread.

Check out the new UI at beta.lmarena.ai !

This beta addresses a lot of the feedback that we’ve received for over two years from our community: it has fewer errors, a better user experience, better performance, and chat history.

More improvements and new features to the new LMArena UI are coming! But this funding isn’t just about scaling infrastructure, it’s about supporting the community that makes LMArena possible. We’re doubling down on what matters: improving the diversity of our voters; more methodological research like style control and prompt-to-leaderboard; more modalities; more open data. Our mission remains the same. To bring the best AI models to everyone, and to improve them through real-world community evaluations.

May 20, 2025 • 4 tweets • 2 min read

🚨Breaking from Arena: @GoogleDeepMind's new Gemini-2.5-Flash climbs to #2 overall in chat, a major jump from its April release (#5 → #2)!

Highlights:
- Top-2 across major categories (Hard, Coding, Math)
- #3 in WebDev Arena, #2 in Vision Arena
- New model at the cost-performance Pareto frontier

AI progress continues to accelerate🔥Congrats again @GoogleDeepMind!

New Gemini-2.5-Flash ranks #2 across major categories (Hard, Coding, Math)

Mar 27, 2025 • 4 tweets • 2 min read

News: the latest ChatGPT-4o (2025-03-26) jumps to #2 on Arena, surpassing GPT-4.5!

Highlights
- Significant improvement over the January version (+30 pts, #5->#2)
- Tied #1 in Coding, Hard Prompts. Top-2 across ALL categories
- Matching or surpassing GPT-4.5 with 10x cheaper price.

Congrats @OpenAI for the new milestone! View below for more insights ⬇️

We saw clear leaps in improvements over the previous ChatGPT-4o release:

- Math #14 -> #2
- Hard Prompts #7 -> #1
- Coding #5 -> #1

Mar 25, 2025 • 5 tweets • 2 min read

BREAKING: Gemini 2.5 Pro is now #1 on the Arena leaderboard - the largest score jump ever (+40 pts vs Grok-3/GPT-4.5)! 🏆

Tested under codename "nebula"🌌, Gemini 2.5 Pro ranked #1🥇 across ALL categories and UNIQUELY #1 in Math, Creative Writing, Instruction Following, Longer Query, and Multi-Turn!

Massive congrats to @GoogleDeepMind for this incredible Arena milestone! 🙌

More highlights in thread👇

Gemini 2.5 Pro #1 across ALL categories, tied #1 with Grok-3/GPT-4.5 for Hard Prompts and Coding, and edged out across all others to take the lead 🏇🏆

Mar 12, 2025 • 5 tweets • 2 min read

🎉 Congrats to @GoogleDeepMind on Gemma-3-27B, the newest and one of the strongest open models in Arena!

💠 Top 10 overall - beating out many proprietary models with only 27B parameter
💠 2nd best open model only below DeepSeek-R1
💠 128K context window

Check out their blog to learn more about Gemma 3. We can't wait to see where this goes next! 🔥👏

How does Gemma-3 stack up across different categories?
💠 Creative writing #6
💠 Math, Coding, Hard Prompts #13-#15

Feb 26, 2025 • 6 tweets • 4 min read

Introducing Prompt-to-leaderboard (P2L): a real-time LLM leaderboard tailored exactly to your use case!

P2L trains an LLM to generate "prompt-specific" leaderboards, so you can input a prompt and get a leaderboard specifically for that prompt.

The model is trained on the 2M human preference votes from Chatbot Arena.

P2L Highlights:
🔹Instant leaderboard for any prompt 🗿
🔹Optimal model routing (hit #1 on Chatbot Arena in Jan 2025 with 1395 score 🧏)
🔹Fine-grained model strength & weakness analysis 🤓

Check out our demo and thread below for more details!

Use case 1: Optimal Routing

If we know which models are best per-prompt, that makes optimal routing easy!

- Performance: P2L-router (experimental-router-0112) is #1 on Chatbot Arena in Jan 2025 with a score of 1395. (+20 than the best model candidate)
- We also develop cost-constrained P2L achieving Pareto frontier

Feb 18, 2025 • 5 tweets • 2 min read

BREAKING: @xAI early version of Grok-3 (codename "chocolate") is now #1 in Arena! 🏆

Grok-3 is:
- First-ever model to break 1400 score!
- #1 across all categories, a milestone that keeps getting harder to achieve

Huge congratulations to @xAI on this milestone! View thread 🧵 for more insights into Grok-3's performance after ~8K votes in the Arena.

Here you can see @xai Grok-3’s performance across all the top categories:
🔹 Overall w/ Style Control
🔹 Hard Prompts & Hard Prompt w/ Style Control
🔹 Coding
🔹 Math
🔹 Creative Writing
🔹 Instruction Following
🔹 Longer Query
🔹 Multi-Turn

Feb 7, 2025 • 6 tweets • 2 min read

Introducing Arena-Price Plot! 💰📊

An interactive plot of price vs. performance trade-offs for LLMs.

Frontier efficiency models:
🔹 Gemini-2.0-Flash/Lite by @GoogleDeepMind
🔹 DeepSeek-R1 by @deepseek_ai
🔹 GPT-4o by @OpenAI
🔹 Yi-Lightning by @01AI_Yi
🔹 Ministral 8B by @MistralAI

LLM efficiency is accelerating—kudos to the labs driving the frontier!

You can select organizations to highlight their models in Arena-Price Plot.

Jan 24, 2025 • 6 tweets • 2 min read

Breaking News: DeepSeek-R1 surges to the top-3 in Arena🐳!

Now ranked #3 Overall, matching the top reasoning model, o1, while being 20x cheaper and open-weight!

Highlights:
- #1 in technical domains: Hard Prompts, Coding, Math
- Joint #1 under Style Control
- MIT-licensed

A massive congrats to @deepseek_ai for this incredible milestone and gift to the community! More analysis below 👇

In Hard Prompt with Style Control, DeepSeek-R1 ranked joint #1 with o1.

Jan 6, 2025 • 6 tweets • 2 min read

🎨Text-to-Image Arena Leaderboard is now live with 40K+ community votes!

Top Models:
- #1. Recraft V3
- #2. Ideogram 2.0
- #3. FLUX1.1 [pro]
- #3. Luma Photon
- #5. DALL·E 3
- #5. FLUX.1 [dev]
- #7. Stable Diffusion 3.5 Large

Congrats to @recraftai @ideogram_ai @bfl_ml @LumaLabsAI for claiming the top spots! Which model you’d like to see next? Comment to let us know.

More analysis and leaderboard link below👇

The Overall leaderboard combines both User Prompts and Pre-generated Prompts, which we provide breakdown in category.

In User Prompts Only breakdown (74% of total votes), Ideogram 2.0 jumps to #1!

Nov 20, 2024 • 5 tweets • 2 min read

Exciting News from Chatbot Arena❤️‍🔥

Over the past week, the latest @OpenAI ChatGPT-4o (20241120) competed anonymously as "anonymous-chatbot", gathering 8,000+ community votes.

The result? OpenAI reclaims the #1 spot, surpassing Gemini-Exp-1114 with an impressive 1361 score!

Latest GPT-4o shows remarkable improvements – we see a leap in creative writing (1365 → 1402) as well as technical domains (e.g., coding, math).

Category Rankings:

- Overall: #2 → #1
- Overall (StyleCtrl): #2 → #1
- Creative Writing: #2 → #1
- Coding: #2 → #1
- Math: #4 → #3
- Hard: #2 → #1

Huge congrats @OpenAI! More analysis below👇

Latest ChatGPT-4o remains #1 with Style Control, and improvement across the board.

Nov 14, 2024 • 8 tweets • 3 min read

Massive News from Chatbot Arena🔥

@GoogleDeepMind's latest Gemini (Exp 1114), tested with 6K+ community votes over the past week, now ranks joint #1 overall with an impressive 40+ score leap — matching 4o-latest in and surpassing o1-preview! It also claims #1 on Vision leaderboard.

Gemini-Exp-1114 excels across technical and creative domains:

- Overall #3 -> #1
- Math: #3 -> #1
- Hard Prompts: #4 -> #1
- Creative Writing #2 -> #1
- Vision: #2 -> #1
- Coding: #5 -> #3
- Overall (StyleCtrl): #4 -> #4

Huge congrats to @GoogleDeepMind on this remarkable milestone!

Come try the new Gemini and share your feedback!

Gemini-Exp-1114 joint #1 in Math Arena, matching o1 performance!

Nov 12, 2024 • 7 tweets • 3 min read

Which model is best for coding? @CopilotArena leaderboard is out!

Our code completions leaderboard contains data collected over the last month, with >100K completions served and >10K votes!

Let’s discuss our findings so far🧵

Here are our main takeaways from the leaderboard:

- With our prompting method, Sonnet-3.5 is able to compete with code-specific models like Deepseek V2.5 on code completion.
- Within a tier, we still observe slight fluctuations as we obtain more votes.
- We find that GPT-4o-mini is much worse than all other models.

2/n

Oct 16, 2024 • 7 tweets • 3 min read

Introducing Copilot Arena - Interactive coding evaluation in the wild.

Our extension lets you test top models for free, right in VSCode. Let's vote and build the Copilot leaderboard!

Download here:

Led by @iamwaynechi and @valeriechen_ at CMU. 1/🧵 marketplace.visualstudio.com/items?itemName…

2/ What is Copilot Arena?

Unlike traditional code completion, Copilot Arena suggests paired completions from different LLMs, including GPT-4o, Claude-3.5, Gemini, Llama-3.1 and more.

Together let's build an LLM leaderboard for code. The more votes, the better the leaderboard!

Sep 27, 2024 • 4 tweets • 2 min read

Exciting update from Vision Chatbot Arena!

We’ve gathered over 6K new votes for the latest open vision models (Qwen2, Llama 3.2, Pixtral) and ChatGPT-4o.

- ChatGPT-4o has taken the #1 spot, surpassing Gemini.
- Open models (Qwen, Llama 3.2, Pixtral) are rapidly improving, matching proprietary offerings

Competition in vision is heating up. Cast your vote and help decide the best vision model!

Full leaderboard link below👇

Link:

Vision leaderboard: confidence intervals plot lmarena.ai/leaderboard

Aug 23, 2024 • 6 tweets • 2 min read

Chatbot Arena update❤️‍🔥

Exciting news—@xAI's Grok-2 and Grok-mini are now officially on the leaderboard!

With over 6000 community votes, Grok-2 has claimed the #2 spot, surpassing GPT-4o (May) and tying with the latest Gemini! Grok-2-mini also impresses at #5.

Grok-2 excels in Math (#1), and #2 across the boards (Hard Prompts, Coding, Instruction-following). More plot analysis in 2nd post👇

Huge congratulations to @xAI on this remarkable achievement!

We’re rolling out a new "Overview" feature for the leaderboard. @xAI's Grok-2 stands out, ranking at the top across all categories—Math, Hard Prompts, Coding, and Instruction-following.

Aug 1, 2024 • 5 tweets • 3 min read

Exciting News from Chatbot Arena!

@GoogleDeepMind's new Gemini 1.5 Pro (Experimental 0801) has been tested in Arena for the past week, gathering over 12K community votes.

For the first time, Google Gemini has claimed the #1 spot, surpassing GPT-4o/Claude-3.5 with an impressive score of 1300 (!), and also achieving #1 on our Vision Leaderboard.

Gemini 1.5 Pro (0801) excels in multi-lingual tasks and delivers robust performance in technical areas like Math, Hard Prompts, and Coding.

Huge congrats to @GoogleDeepMind on this remarkable milestone!

Gemini (0801) Category Rankings:
- Overall: #1
- Math: #1-3
- Instruction-Following: #1-2
- Coding: #3-5
- Hard Prompts (English): #2-5

Come try the model and let us know your feedback!
More analysis below👇

Gemini 1.5 Pro (Experimental 0801) #1 on Vision Leaderboard.

Jul 1, 2024 • 6 tweets • 4 min read

Not all questions need GPT-4!

We introduce RouteLLM – a routing framework based on human preference data that directs simple queries to a cheaper model.

With data augmentation techniques, RouteLLM achieves cost reductions of over 85% on MT Bench and 45% on MMLU while maintaining 95% GPT-4 performance.

Comparing against commercial offerings (Martian and Unify AI) on MT Bench and achieve the same performance while being over 40% cheaper.

Our model, datasets, and code for serving and evaluating LLM routers are all open-sourced. We are excited to see what the community will build on top!

1/6 (blog post & links👇)

With public data from Chatbot Arena, we trained four different routers using data augmentation techniques to significantly improve router performance.

By routing between GPT-4 and Mixtral-8x7B, we demonstrate cost reductions of over 85% on MT Bench and 45% on MMLU while achieving 95% of GPT-4's performance.

2/6

May 9, 2024 • 8 tweets • 3 min read

Exciting new blog -- What’s up with Llama-3?

Since Llama 3’s release, it has quickly jumped to top of the leaderboard. We dive into our data and answer below questions:

- What are users asking? When do users prefer Llama 3?
- How challenging are the prompts?
- Are certain users or prompts over-represented?
- Does Llama 3 have qualitative differences that make users like it?

Key Insights:
1. Llama 3 beats top-tier models on open-ended writing and creative problems but loses a bit on close-ended math and coding problems.

2. As prompts get challenging*, the gap between Llama 3 against top-tier models becomes larger.

* We define challenging using several criteria like complexity, problem-solving, domain knowledge, and more.

Share this page!

Enter URL or ID to Unroll