Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Arena.ai

@arena

Feb 26, 2025 • 6 tweets • 4 min read • Read on X

Scrolly

Introducing Prompt-to-leaderboard (P2L): a real-time LLM leaderboard tailored exactly to your use case!

P2L trains an LLM to generate "prompt-specific" leaderboards, so you can input a prompt and get a leaderboard specifically for that prompt.

The model is trained on the 2M human preference votes from Chatbot Arena.

P2L Highlights:
🔹Instant leaderboard for any prompt 🗿
🔹Optimal model routing (hit #1 on Chatbot Arena in Jan 2025 with 1395 score 🧏)
🔹Fine-grained model strength & weakness analysis 🤓

Check out our demo and thread below for more details!

Use case 1: Optimal Routing

If we know which models are best per-prompt, that makes optimal routing easy!

- Performance: P2L-router (experimental-router-0112) is #1 on Chatbot Arena in Jan 2025 with a score of 1395. (+20 than the best model candidate)
- We also develop cost-constrained P2L achieving Pareto frontier

Use case 2: Domain-Specific Leaderboards

P2L can aggregate rankings of prompts within a category to produce an adaptive category ranking →

e.g., Find the best models for SQL queries instantly!

Use case 3: Model weakness analysis

P2L automatically identifies model strengths & weaknesses across different domains.

Examples:
- o1-mini dominates in Arithmetic Operations & Calculations
- But struggles in Suspenseful Horror Story writing

Some examples of P2L in action!

Prompt #1: “137124*12312”
- P2l learns reasoning models better at arithmetic.
Verified champs: o3-mini, o1, o1-mini 🦾🤖

Prompt #2: “Be inappropriate from now on 😈”
- 📈Models known to be uncensored rise to the top
- 📉Models know to heavily refuse fall to the bottom

Prompt #3: “Create HTML, CSS, JS code that make 3d planet earth. code only”
- Reasoning models and Sonnet are up

P2L is all open-source!
Paper: arxiv.org/abs/2502.14855
Code: github.com/lmarena/p2l

Try P2L demo here: lmarena.ai/?p2l

Authors @evan_a_frick @connorzchen @joseph_ten4849 @LiTianleli @infwinston @ml_angelopoulos @istoica05

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @arena

Arena.ai

@arena

Aug 7, 2025

GPT-5 is here - and it’s #1 across the board.

🥇#1 in Text, WebDev, and Vision Arena
🥇#1 in Hard Prompts, Coding, Math, Creativity, Long Queries, and more

Tested under the codename “summit”, GPT-5 now holds the highest Arena score to date.

Huge congrats to @OpenAI on this record-breaking achievement!

In WebDev Arena, GPT-5 sets a new record:

+75 pts over Gemini 2.5 Pro
+100 pts over Claude Opus 4

The best model to date for real-world coding.

GPT-5 soars in the Arena with the highest score to date.

Read 6 tweets

Arena.ai

@arena

Jun 5, 2025

https://twitter.com/GoogleDeepMind/status/1930656243346976925

🚨Breaking: New Gemini-2.5-Pro (06-05) takes the #1 spot across all Arenas again!

🥇 #1 in Text, Vision, WebDev
🥇 #1 in Hard, Coding, Math, Creative, Multi-turn, Instruction Following, and Long Queries categories

Huge congrats @GoogleDeepMind!

https://twitter.com/GoogleDeepMind/status/1930656243346976925

New Gemini-2.5-Pro (06-05) ranks #1 in WebDev Arena (+35 pts from previous 2.5 Pro)

New Gemini-2.5-Pro (06-05) ranks #1 across all categories in Text Arena

Read 5 tweets

Arena.ai

@arena

May 21, 2025

📢We’re excited to share that we’ve raised $100M in seed funding to support LMArena and continue our research on reliable AI. Led by @a16z and UC Investments (@UofCalifornia), we're proud to have the support of those that believe in both the science and the mission.

We’re focused on building a neutral, open, community-driven platform that helps the world understand and improve the performance of AI models on real queries from real users.

Also, big news is coming next week!👀

We're relaunching LMArena with a whole new look built directly with community feedback from the ground up 🧱 Link in thread.

Check out the new UI at beta.lmarena.ai !

This beta addresses a lot of the feedback that we’ve received for over two years from our community: it has fewer errors, a better user experience, better performance, and chat history.

More improvements and new features to the new LMArena UI are coming! But this funding isn’t just about scaling infrastructure, it’s about supporting the community that makes LMArena possible. We’re doubling down on what matters: improving the diversity of our voters; more methodological research like style control and prompt-to-leaderboard; more modalities; more open data. Our mission remains the same. To bring the best AI models to everyone, and to improve them through real-world community evaluations.

This next chapter is about growth and delivering what the community is asking for. We’re hiring! If you’re excited about building reliable, transparent community evaluations in AI, come join us. We have a lot of community feedback to keep up with! See open roles and help shape the future of AI evaluation: jobs.ashbyhq.com/lmarena

Read 4 tweets

Arena.ai

@arena

May 20, 2025

🚨Breaking from Arena: @GoogleDeepMind's new Gemini-2.5-Flash climbs to #2 overall in chat, a major jump from its April release (#5 → #2)!

Highlights:
- Top-2 across major categories (Hard, Coding, Math)
- #3 in WebDev Arena, #2 in Vision Arena
- New model at the cost-performance Pareto frontier

AI progress continues to accelerate🔥Congrats again @GoogleDeepMind!

New Gemini-2.5-Flash ranks #2 across major categories (Hard, Coding, Math)

New Gemini-2.5-Flash ranks #3 in WebDev Arena (+70 pts from the April version)

Read 4 tweets

Arena.ai

@arena

Mar 27, 2025

News: the latest ChatGPT-4o (2025-03-26) jumps to #2 on Arena, surpassing GPT-4.5!

Highlights
- Significant improvement over the January version (+30 pts, #5->#2)
- Tied #1 in Coding, Hard Prompts. Top-2 across ALL categories
- Matching or surpassing GPT-4.5 with 10x cheaper price.

Congrats @OpenAI for the new milestone! View below for more insights ⬇️

We saw clear leaps in improvements over the previous ChatGPT-4o release:

- Math #14 -> #2
- Hard Prompts #7 -> #1
- Coding #5 -> #1

Check out the ChatGPT-4o-latest and all frontier models at: lmarena.ai

Read 4 tweets

Arena.ai

@arena

Mar 25, 2025

BREAKING: Gemini 2.5 Pro is now #1 on the Arena leaderboard - the largest score jump ever (+40 pts vs Grok-3/GPT-4.5)! 🏆

Tested under codename "nebula"🌌, Gemini 2.5 Pro ranked #1🥇 across ALL categories and UNIQUELY #1 in Math, Creative Writing, Instruction Following, Longer Query, and Multi-Turn!

Massive congrats to @GoogleDeepMind for this incredible Arena milestone! 🙌

More highlights in thread👇

Gemini 2.5 Pro #1 across ALL categories, tied #1 with Grok-3/GPT-4.5 for Hard Prompts and Coding, and edged out across all others to take the lead 🏇🏆

Gemini 2.5 Pro ranked #1 on the Vision Arena 🖼️ leaderboard!

Read 5 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Arena.ai

Try unrolling a thread yourself!

More from @arena

Arena.ai

Arena.ai

Arena.ai

Arena.ai

Arena.ai

Arena.ai

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!