lmarena.ai (formerly lmsys.org) Profile picture
Feb 26 6 tweets 4 min read Read on X
Introducing Prompt-to-leaderboard (P2L): a real-time LLM leaderboard tailored exactly to your use case!

P2L trains an LLM to generate "prompt-specific" leaderboards, so you can input a prompt and get a leaderboard specifically for that prompt.

The model is trained on the 2M human preference votes from Chatbot Arena.

P2L Highlights:
🔹Instant leaderboard for any prompt 🗿
🔹Optimal model routing (hit #1 on Chatbot Arena in Jan 2025 with 1395 score 🧏)
🔹Fine-grained model strength & weakness analysis 🤓

Check out our demo and thread below for more details!
Use case 1: Optimal Routing

If we know which models are best per-prompt, that makes optimal routing easy!

- Performance: P2L-router (experimental-router-0112) is #1 on Chatbot Arena in Jan 2025 with a score of 1395. (+20 than the best model candidate)
- We also develop cost-constrained P2L achieving Pareto frontierImage
Use case 2: Domain-Specific Leaderboards

P2L can aggregate rankings of prompts within a category to produce an adaptive category ranking →

e.g., Find the best models for SQL queries instantly! Image
Use case 3: Model weakness analysis

P2L automatically identifies model strengths & weaknesses across different domains.

Examples:
- o1-mini dominates in Arithmetic Operations & Calculations
- But struggles in Suspenseful Horror Story writing Image
Some examples of P2L in action!

Prompt #1: “137124*12312”
- P2l learns reasoning models better at arithmetic.
Verified champs: o3-mini, o1, o1-mini 🦾🤖

Prompt #2: “Be inappropriate from now on 😈”
- 📈Models known to be uncensored rise to the top
- 📉Models know to heavily refuse fall to the bottom

Prompt #3: “Create HTML, CSS, JS code that make 3d planet earth. code only”
- Reasoning models and Sonnet are upImage
Image
Image
P2L is all open-source!
Paper: arxiv.org/abs/2502.14855
Code: github.com/lmarena/p2l

Try P2L demo here: lmarena.ai/?p2l

Authors @evan_a_frick @connorzchen @joseph_ten4849 @LiTianleli @infwinston @ml_angelopoulos @istoica05

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org) Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lmarena_ai

Feb 18
BREAKING: @xAI early version of Grok-3 (codename "chocolate") is now #1 in Arena! 🏆

Grok-3 is:
- First-ever model to break 1400 score!
- #1 across all categories, a milestone that keeps getting harder to achieve

Huge congratulations to @xAI on this milestone! View thread 🧵 for more insights into Grok-3's performance after ~8K votes in the Arena.Image
Here you can see @xai Grok-3’s performance across all the top categories:
🔹 Overall w/ Style Control
🔹 Hard Prompts & Hard Prompt w/ Style Control
🔹 Coding
🔹 Math
🔹 Creative Writing
🔹 Instruction Following
🔹 Longer Query
🔹 Multi-Turn Image
In the Coding category, Grok-3 surpassed top reasoning models like o1 and Gemini-thinking. Image
Read 5 tweets
Feb 7
Introducing Arena-Price Plot! 💰📊

An interactive plot of price vs. performance trade-offs for LLMs.

Frontier efficiency models:
🔹 Gemini-2.0-Flash/Lite by @GoogleDeepMind
🔹 DeepSeek-R1 by @deepseek_ai
🔹 GPT-4o by @OpenAI
🔹 Yi-Lightning by @01AI_Yi
🔹 Ministral 8B by @MistralAI

LLM efficiency is accelerating—kudos to the labs driving the frontier!Image
You can select organizations to highlight their models in Arena-Price Plot. Image
You can select a specific category (e.g, Coding, Math, Creative Writing, …) for your tasks. Image
Read 6 tweets
Jan 24
Breaking News: DeepSeek-R1 surges to the top-3 in Arena🐳!

Now ranked #3 Overall, matching the top reasoning model, o1, while being 20x cheaper and open-weight!

Highlights:
- #1 in technical domains: Hard Prompts, Coding, Math
- Joint #1 under Style Control
- MIT-licensed

A massive congrats to @deepseek_ai for this incredible milestone and gift to the community! More analysis below 👇Image
In Hard Prompt with Style Control, DeepSeek-R1 ranked joint #1 with o1. Image
Early results show DeepSeek-R1 strong across all domains! More votes are being collected for stable rankings. Image
Read 6 tweets
Jan 6
🎨Text-to-Image Arena Leaderboard is now live with 40K+ community votes!

Top Models:
- #1. Recraft V3
- #2. Ideogram 2.0
- #3. FLUX1.1 [pro]
- #3. Luma Photon
- #5. DALL·E 3
- #5. FLUX.1 [dev]
- #7. Stable Diffusion 3.5 Large

Congrats to @recraftai @ideogram_ai @bfl_ml @LumaLabsAI for claiming the top spots! Which model you’d like to see next? Comment to let us know.

More analysis and leaderboard link below👇Image
The Overall leaderboard combines both User Prompts and Pre-generated Prompts, which we provide breakdown in category.

In User Prompts Only breakdown (74% of total votes), Ideogram 2.0 jumps to #1! Image
Interestingly, model rankings shift in the Pre-generated Prompts (26% of total votes). FLUX/Stable Diffusion improves significantly and DALLE/Ideogram drop.

Our preset prompts are taken from:
huggingface.co/datasets/data-…Image
Read 6 tweets
Nov 20, 2024
Exciting News from Chatbot Arena❤️‍🔥

Over the past week, the latest @OpenAI ChatGPT-4o (20241120) competed anonymously as "anonymous-chatbot", gathering 8,000+ community votes.

The result? OpenAI reclaims the #1 spot, surpassing Gemini-Exp-1114 with an impressive 1361 score!

Latest GPT-4o shows remarkable improvements – we see a leap in creative writing (1365 → 1402) as well as technical domains (e.g., coding, math).

Category Rankings:

- Overall: #2 → #1
- Overall (StyleCtrl): #2 → #1
- Creative Writing: #2 → #1
- Coding: #2 → #1
- Math: #4 → #3
- Hard: #2 → #1

Huge congrats @OpenAI! More analysis below👇Image
Latest ChatGPT-4o remains #1 with Style Control, and improvement across the board. Image
A big leap in creative writing (1365 → 1402) Image
Read 5 tweets
Nov 14, 2024
Massive News from Chatbot Arena🔥

@GoogleDeepMind's latest Gemini (Exp 1114), tested with 6K+ community votes over the past week, now ranks joint #1 overall with an impressive 40+ score leap — matching 4o-latest in and surpassing o1-preview! It also claims #1 on Vision leaderboard.

Gemini-Exp-1114 excels across technical and creative domains:

- Overall #3 -> #1
- Math: #3 -> #1
- Hard Prompts: #4 -> #1
- Creative Writing #2 -> #1
- Vision: #2 -> #1
- Coding: #5 -> #3
- Overall (StyleCtrl): #4 -> #4

Huge congrats to @GoogleDeepMind on this remarkable milestone!

Come try the new Gemini and share your feedback!Image
Gemini-Exp-1114 joint #1 in Math Arena, matching o1 performance! Image
Gemini-Exp-1114 ranking across categories Image
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(