Introducing Prompt-to-leaderboard (P2L): a real-time LLM leaderboard tailored exactly to your use case!
P2L trains an LLM to generate "prompt-specific" leaderboards, so you can input a prompt and get a leaderboard specifically for that prompt.
The model is trained on the 2M human preference votes from Chatbot Arena.
P2L Highlights:
🔹Instant leaderboard for any prompt 🗿
🔹Optimal model routing (hit #1 on Chatbot Arena in Jan 2025 with 1395 score 🧏)
🔹Fine-grained model strength & weakness analysis 🤓
Check out our demo and thread below for more details!
Use case 1: Optimal Routing
If we know which models are best per-prompt, that makes optimal routing easy!
- Performance: P2L-router (experimental-router-0112) is #1 on Chatbot Arena in Jan 2025 with a score of 1395. (+20 than the best model candidate)
- We also develop cost-constrained P2L achieving Pareto frontier
Use case 2: Domain-Specific Leaderboards
P2L can aggregate rankings of prompts within a category to produce an adaptive category ranking →
e.g., Find the best models for SQL queries instantly!
Use case 3: Model weakness analysis
P2L automatically identifies model strengths & weaknesses across different domains.
Examples:
- o1-mini dominates in Arithmetic Operations & Calculations
- But struggles in Suspenseful Horror Story writing
BREAKING: @xAI early version of Grok-3 (codename "chocolate") is now #1 in Arena! 🏆
Grok-3 is:
- First-ever model to break 1400 score!
- #1 across all categories, a milestone that keeps getting harder to achieve
Huge congratulations to @xAI on this milestone! View thread 🧵 for more insights into Grok-3's performance after ~8K votes in the Arena.
Here you can see @xai Grok-3’s performance across all the top categories:
🔹 Overall w/ Style Control
🔹 Hard Prompts & Hard Prompt w/ Style Control
🔹 Coding
🔹 Math
🔹 Creative Writing
🔹 Instruction Following
🔹 Longer Query
🔹 Multi-Turn
In the Coding category, Grok-3 surpassed top reasoning models like o1 and Gemini-thinking.
An interactive plot of price vs. performance trade-offs for LLMs.
Frontier efficiency models:
🔹 Gemini-2.0-Flash/Lite by @GoogleDeepMind
🔹 DeepSeek-R1 by @deepseek_ai
🔹 GPT-4o by @OpenAI
🔹 Yi-Lightning by @01AI_Yi
🔹 Ministral 8B by @MistralAI
LLM efficiency is accelerating—kudos to the labs driving the frontier!
You can select organizations to highlight their models in Arena-Price Plot.
You can select a specific category (e.g, Coding, Math, Creative Writing, …) for your tasks.
Congrats to @recraftai @ideogram_ai @bfl_ml @LumaLabsAI for claiming the top spots! Which model you’d like to see next? Comment to let us know.
More analysis and leaderboard link below👇
The Overall leaderboard combines both User Prompts and Pre-generated Prompts, which we provide breakdown in category.
In User Prompts Only breakdown (74% of total votes), Ideogram 2.0 jumps to #1!
Interestingly, model rankings shift in the Pre-generated Prompts (26% of total votes). FLUX/Stable Diffusion improves significantly and DALLE/Ideogram drop.
@GoogleDeepMind's latest Gemini (Exp 1114), tested with 6K+ community votes over the past week, now ranks joint #1 overall with an impressive 40+ score leap — matching 4o-latest in and surpassing o1-preview! It also claims #1 on Vision leaderboard.
Gemini-Exp-1114 excels across technical and creative domains: