Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

lmarena.ai (formerly lmsys.org)

@lmarena_ai

Feb 26 • 6 tweets • 4 min read • Read on X

Scrolly

Introducing Prompt-to-leaderboard (P2L): a real-time LLM leaderboard tailored exactly to your use case!

P2L trains an LLM to generate "prompt-specific" leaderboards, so you can input a prompt and get a leaderboard specifically for that prompt.

The model is trained on the 2M human preference votes from Chatbot Arena.

P2L Highlights:
🔹Instant leaderboard for any prompt 🗿
🔹Optimal model routing (hit #1 on Chatbot Arena in Jan 2025 with 1395 score 🧏)
🔹Fine-grained model strength & weakness analysis 🤓

Check out our demo and thread below for more details!

Use case 1: Optimal Routing

If we know which models are best per-prompt, that makes optimal routing easy!

- Performance: P2L-router (experimental-router-0112) is #1 on Chatbot Arena in Jan 2025 with a score of 1395. (+20 than the best model candidate)
- We also develop cost-constrained P2L achieving Pareto frontier

Use case 2: Domain-Specific Leaderboards

P2L can aggregate rankings of prompts within a category to produce an adaptive category ranking →

e.g., Find the best models for SQL queries instantly!

Use case 3: Model weakness analysis

P2L automatically identifies model strengths & weaknesses across different domains.

Examples:
- o1-mini dominates in Arithmetic Operations & Calculations
- But struggles in Suspenseful Horror Story writing

Some examples of P2L in action!

Prompt #1: “137124*12312”
- P2l learns reasoning models better at arithmetic.
Verified champs: o3-mini, o1, o1-mini 🦾🤖

Prompt #2: “Be inappropriate from now on 😈”
- 📈Models known to be uncensored rise to the top
- 📉Models know to heavily refuse fall to the bottom

Prompt #3: “Create HTML, CSS, JS code that make 3d planet earth. code only”
- Reasoning models and Sonnet are up

P2L is all open-source!
Paper: arxiv.org/abs/2502.14855
Code: github.com/lmarena/p2l

Try P2L demo here: lmarena.ai/?p2l

Authors @evan_a_frick @connorzchen @joseph_ten4849 @LiTianleli @infwinston @ml_angelopoulos @istoica05

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @lmarena_ai

lmarena.ai (formerly lmsys.org)

@lmarena_ai

Feb 18

BREAKING: @xAI early version of Grok-3 (codename "chocolate") is now #1 in Arena! 🏆

Grok-3 is:
- First-ever model to break 1400 score!
- #1 across all categories, a milestone that keeps getting harder to achieve

Huge congratulations to @xAI on this milestone! View thread 🧵 for more insights into Grok-3's performance after ~8K votes in the Arena.

Here you can see @xai Grok-3’s performance across all the top categories:
🔹 Overall w/ Style Control
🔹 Hard Prompts & Hard Prompt w/ Style Control
🔹 Coding
🔹 Math
🔹 Creative Writing
🔹 Instruction Following
🔹 Longer Query
🔹 Multi-Turn

In the Coding category, Grok-3 surpassed top reasoning models like o1 and Gemini-thinking.

Read 5 tweets

lmarena.ai (formerly lmsys.org)

@lmarena_ai

Feb 7

Introducing Arena-Price Plot! 💰📊

An interactive plot of price vs. performance trade-offs for LLMs.

Frontier efficiency models:
🔹 Gemini-2.0-Flash/Lite by @GoogleDeepMind
🔹 DeepSeek-R1 by @deepseek_ai
🔹 GPT-4o by @OpenAI
🔹 Yi-Lightning by @01AI_Yi
🔹 Ministral 8B by @MistralAI

LLM efficiency is accelerating—kudos to the labs driving the frontier!

You can select organizations to highlight their models in Arena-Price Plot.

You can select a specific category (e.g, Coding, Math, Creative Writing, …) for your tasks.

Read 6 tweets

lmarena.ai (formerly lmsys.org)

@lmarena_ai

Jan 24

Breaking News: DeepSeek-R1 surges to the top-3 in Arena🐳!

Now ranked #3 Overall, matching the top reasoning model, o1, while being 20x cheaper and open-weight!

Highlights:
- #1 in technical domains: Hard Prompts, Coding, Math
- Joint #1 under Style Control
- MIT-licensed

A massive congrats to @deepseek_ai for this incredible milestone and gift to the community! More analysis below 👇

In Hard Prompt with Style Control, DeepSeek-R1 ranked joint #1 with o1.

Early results show DeepSeek-R1 strong across all domains! More votes are being collected for stable rankings.

Read 6 tweets

lmarena.ai (formerly lmsys.org)

@lmarena_ai

Jan 6

🎨Text-to-Image Arena Leaderboard is now live with 40K+ community votes!

Top Models:
- #1. Recraft V3
- #2. Ideogram 2.0
- #3. FLUX1.1 [pro]
- #3. Luma Photon
- #5. DALL·E 3
- #5. FLUX.1 [dev]
- #7. Stable Diffusion 3.5 Large

Congrats to @recraftai @ideogram_ai @bfl_ml @LumaLabsAI for claiming the top spots! Which model you’d like to see next? Comment to let us know.

More analysis and leaderboard link below👇

The Overall leaderboard combines both User Prompts and Pre-generated Prompts, which we provide breakdown in category.

In User Prompts Only breakdown (74% of total votes), Ideogram 2.0 jumps to #1!

Interestingly, model rankings shift in the Pre-generated Prompts (26% of total votes). FLUX/Stable Diffusion improves significantly and DALLE/Ideogram drop.

Our preset prompts are taken from:
huggingface.co/datasets/data-…

Read 6 tweets

lmarena.ai (formerly lmsys.org)

@lmarena_ai

Nov 20, 2024

Exciting News from Chatbot Arena❤️‍🔥

Over the past week, the latest @OpenAI ChatGPT-4o (20241120) competed anonymously as "anonymous-chatbot", gathering 8,000+ community votes.

The result? OpenAI reclaims the #1 spot, surpassing Gemini-Exp-1114 with an impressive 1361 score!

Latest GPT-4o shows remarkable improvements – we see a leap in creative writing (1365 → 1402) as well as technical domains (e.g., coding, math).

Category Rankings:

- Overall: #2 → #1
- Overall (StyleCtrl): #2 → #1
- Creative Writing: #2 → #1
- Coding: #2 → #1
- Math: #4 → #3
- Hard: #2 → #1

Huge congrats @OpenAI! More analysis below👇

Latest ChatGPT-4o remains #1 with Style Control, and improvement across the board.

A big leap in creative writing (1365 → 1402)

Read 5 tweets

lmarena.ai (formerly lmsys.org)

@lmarena_ai

Nov 14, 2024

Massive News from Chatbot Arena🔥

@GoogleDeepMind's latest Gemini (Exp 1114), tested with 6K+ community votes over the past week, now ranks joint #1 overall with an impressive 40+ score leap — matching 4o-latest in and surpassing o1-preview! It also claims #1 on Vision leaderboard.

Gemini-Exp-1114 excels across technical and creative domains:

- Overall #3 -> #1
- Math: #3 -> #1
- Hard Prompts: #4 -> #1
- Creative Writing #2 -> #1
- Vision: #2 -> #1
- Coding: #5 -> #3
- Overall (StyleCtrl): #4 -> #4

Huge congrats to @GoogleDeepMind on this remarkable milestone!

Come try the new Gemini and share your feedback!

Gemini-Exp-1114 joint #1 in Math Arena, matching o1 performance!

Gemini-Exp-1114 ranking across categories

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

lmarena.ai (formerly lmsys.org)

Try unrolling a thread yourself!

More from @lmarena_ai

lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!