lmarena.ai (formerly lmsys.org) Profile picture
Aug 1, 2024 5 tweets 3 min read Read on X
Exciting News from Chatbot Arena!

@GoogleDeepMind's new Gemini 1.5 Pro (Experimental 0801) has been tested in Arena for the past week, gathering over 12K community votes.

For the first time, Google Gemini has claimed the #1 spot, surpassing GPT-4o/Claude-3.5 with an impressive score of 1300 (!), and also achieving #1 on our Vision Leaderboard.

Gemini 1.5 Pro (0801) excels in multi-lingual tasks and delivers robust performance in technical areas like Math, Hard Prompts, and Coding.

Huge congrats to @GoogleDeepMind on this remarkable milestone!

Gemini (0801) Category Rankings:
- Overall: #1
- Math: #1-3
- Instruction-Following: #1-2
- Coding: #3-5
- Hard Prompts (English): #2-5

Come try the model and let us know your feedback!
More analysis below👇Image
Gemini 1.5 Pro (Experimental 0801) #1 on Vision Leaderboard.
Image
Image
Gemini shows strong multilingual capability: #1 performance in Chinese, Japanese, German, Russian.
Image
Image
But in technical domains like Coding/Hard Prompt Arena, Claude 3.5 Sonnet, GPT-4o, Llama 405B are still leading the way.
Image
Image
Overall win-rate heatmap: Gemini 1.5 Pro (0801) wins 54% vs GPT-4o, 59% vs Claude-3.5-Sonnet.

Check out full data at and come chat with the model! leaderboard.lmsys.org
Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org) Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lmarena_ai

Feb 26
Introducing Prompt-to-leaderboard (P2L): a real-time LLM leaderboard tailored exactly to your use case!

P2L trains an LLM to generate "prompt-specific" leaderboards, so you can input a prompt and get a leaderboard specifically for that prompt.

The model is trained on the 2M human preference votes from Chatbot Arena.

P2L Highlights:
🔹Instant leaderboard for any prompt 🗿
🔹Optimal model routing (hit #1 on Chatbot Arena in Jan 2025 with 1395 score 🧏)
🔹Fine-grained model strength & weakness analysis 🤓

Check out our demo and thread below for more details!
Use case 1: Optimal Routing

If we know which models are best per-prompt, that makes optimal routing easy!

- Performance: P2L-router (experimental-router-0112) is #1 on Chatbot Arena in Jan 2025 with a score of 1395. (+20 than the best model candidate)
- We also develop cost-constrained P2L achieving Pareto frontierImage
Use case 2: Domain-Specific Leaderboards

P2L can aggregate rankings of prompts within a category to produce an adaptive category ranking →

e.g., Find the best models for SQL queries instantly! Image
Read 6 tweets
Feb 18
BREAKING: @xAI early version of Grok-3 (codename "chocolate") is now #1 in Arena! 🏆

Grok-3 is:
- First-ever model to break 1400 score!
- #1 across all categories, a milestone that keeps getting harder to achieve

Huge congratulations to @xAI on this milestone! View thread 🧵 for more insights into Grok-3's performance after ~8K votes in the Arena.Image
Here you can see @xai Grok-3’s performance across all the top categories:
🔹 Overall w/ Style Control
🔹 Hard Prompts & Hard Prompt w/ Style Control
🔹 Coding
🔹 Math
🔹 Creative Writing
🔹 Instruction Following
🔹 Longer Query
🔹 Multi-Turn Image
In the Coding category, Grok-3 surpassed top reasoning models like o1 and Gemini-thinking. Image
Read 5 tweets
Feb 7
Introducing Arena-Price Plot! 💰📊

An interactive plot of price vs. performance trade-offs for LLMs.

Frontier efficiency models:
🔹 Gemini-2.0-Flash/Lite by @GoogleDeepMind
🔹 DeepSeek-R1 by @deepseek_ai
🔹 GPT-4o by @OpenAI
🔹 Yi-Lightning by @01AI_Yi
🔹 Ministral 8B by @MistralAI

LLM efficiency is accelerating—kudos to the labs driving the frontier!Image
You can select organizations to highlight their models in Arena-Price Plot. Image
You can select a specific category (e.g, Coding, Math, Creative Writing, …) for your tasks. Image
Read 6 tweets
Jan 24
Breaking News: DeepSeek-R1 surges to the top-3 in Arena🐳!

Now ranked #3 Overall, matching the top reasoning model, o1, while being 20x cheaper and open-weight!

Highlights:
- #1 in technical domains: Hard Prompts, Coding, Math
- Joint #1 under Style Control
- MIT-licensed

A massive congrats to @deepseek_ai for this incredible milestone and gift to the community! More analysis below 👇Image
In Hard Prompt with Style Control, DeepSeek-R1 ranked joint #1 with o1. Image
Early results show DeepSeek-R1 strong across all domains! More votes are being collected for stable rankings. Image
Read 6 tweets
Jan 6
🎨Text-to-Image Arena Leaderboard is now live with 40K+ community votes!

Top Models:
- #1. Recraft V3
- #2. Ideogram 2.0
- #3. FLUX1.1 [pro]
- #3. Luma Photon
- #5. DALL·E 3
- #5. FLUX.1 [dev]
- #7. Stable Diffusion 3.5 Large

Congrats to @recraftai @ideogram_ai @bfl_ml @LumaLabsAI for claiming the top spots! Which model you’d like to see next? Comment to let us know.

More analysis and leaderboard link below👇Image
The Overall leaderboard combines both User Prompts and Pre-generated Prompts, which we provide breakdown in category.

In User Prompts Only breakdown (74% of total votes), Ideogram 2.0 jumps to #1! Image
Interestingly, model rankings shift in the Pre-generated Prompts (26% of total votes). FLUX/Stable Diffusion improves significantly and DALLE/Ideogram drop.

Our preset prompts are taken from:
huggingface.co/datasets/data-…Image
Read 6 tweets
Nov 20, 2024
Exciting News from Chatbot Arena❤️‍🔥

Over the past week, the latest @OpenAI ChatGPT-4o (20241120) competed anonymously as "anonymous-chatbot", gathering 8,000+ community votes.

The result? OpenAI reclaims the #1 spot, surpassing Gemini-Exp-1114 with an impressive 1361 score!

Latest GPT-4o shows remarkable improvements – we see a leap in creative writing (1365 → 1402) as well as technical domains (e.g., coding, math).

Category Rankings:

- Overall: #2 → #1
- Overall (StyleCtrl): #2 → #1
- Creative Writing: #2 → #1
- Coding: #2 → #1
- Math: #4 → #3
- Hard: #2 → #1

Huge congrats @OpenAI! More analysis below👇Image
Latest ChatGPT-4o remains #1 with Style Control, and improvement across the board. Image
A big leap in creative writing (1365 → 1402) Image
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(