@GoogleDeepMind's new Gemini 1.5 Pro (Experimental 0801) has been tested in Arena for the past week, gathering over 12K community votes.
For the first time, Google Gemini has claimed the #1 spot, surpassing GPT-4o/Claude-3.5 with an impressive score of 1300 (!), and also achieving #1 on our Vision Leaderboard.
Gemini 1.5 Pro (0801) excels in multi-lingual tasks and delivers robust performance in technical areas like Math, Hard Prompts, and Coding.
Huge congrats to @GoogleDeepMind on this remarkable milestone!
📢We’re excited to share that we’ve raised $100M in seed funding to support LMArena and continue our research on reliable AI. Led by @a16z and UC Investments (@UofCalifornia), we're proud to have the support of those that believe in both the science and the mission.
We’re focused on building a neutral, open, community-driven platform that helps the world understand and improve the performance of AI models on real queries from real users.
Also, big news is coming next week!👀
We're relaunching LMArena with a whole new look built directly with community feedback from the ground up 🧱 Link in thread.
This beta addresses a lot of the feedback that we’ve received for over two years from our community: it has fewer errors, a better user experience, better performance, and chat history.
More improvements and new features to the new LMArena UI are coming! But this funding isn’t just about scaling infrastructure, it’s about supporting the community that makes LMArena possible. We’re doubling down on what matters: improving the diversity of our voters; more methodological research like style control and prompt-to-leaderboard; more modalities; more open data. Our mission remains the same. To bring the best AI models to everyone, and to improve them through real-world community evaluations.
This next chapter is about growth and delivering what the community is asking for. We’re hiring! If you’re excited about building reliable, transparent community evaluations in AI, come join us. We have a lot of community feedback to keep up with! See open roles and help shape the future of AI evaluation: jobs.ashbyhq.com/lmarena
News: the latest ChatGPT-4o (2025-03-26) jumps to #2 on Arena, surpassing GPT-4.5!
Highlights
- Significant improvement over the January version (+30 pts, #5->#2)
- Tied #1 in Coding, Hard Prompts. Top-2 across ALL categories
- Matching or surpassing GPT-4.5 with 10x cheaper price.
Congrats @OpenAI for the new milestone! View below for more insights ⬇️
We saw clear leaps in improvements over the previous ChatGPT-4o release:
- Math #14 -> #2
- Hard Prompts #7 -> #1
- Coding #5 -> #1
Check out the ChatGPT-4o-latest and all frontier models at: lmarena.ai
BREAKING: Gemini 2.5 Pro is now #1 on the Arena leaderboard - the largest score jump ever (+40 pts vs Grok-3/GPT-4.5)! 🏆
Tested under codename "nebula"🌌, Gemini 2.5 Pro ranked #1🥇 across ALL categories and UNIQUELY #1 in Math, Creative Writing, Instruction Following, Longer Query, and Multi-Turn!
Massive congrats to @GoogleDeepMind for this incredible Arena milestone! 🙌
More highlights in thread👇
Gemini 2.5 Pro #1 across ALL categories, tied #1 with Grok-3/GPT-4.5 for Hard Prompts and Coding, and edged out across all others to take the lead 🏇🏆
Gemini 2.5 Pro ranked #1 on the Vision Arena 🖼️ leaderboard!
Introducing Prompt-to-leaderboard (P2L): a real-time LLM leaderboard tailored exactly to your use case!
P2L trains an LLM to generate "prompt-specific" leaderboards, so you can input a prompt and get a leaderboard specifically for that prompt.
The model is trained on the 2M human preference votes from Chatbot Arena.
P2L Highlights:
🔹Instant leaderboard for any prompt 🗿
🔹Optimal model routing (hit #1 on Chatbot Arena in Jan 2025 with 1395 score 🧏)
🔹Fine-grained model strength & weakness analysis 🤓
Check out our demo and thread below for more details!
Use case 1: Optimal Routing
If we know which models are best per-prompt, that makes optimal routing easy!
- Performance: P2L-router (experimental-router-0112) is #1 on Chatbot Arena in Jan 2025 with a score of 1395. (+20 than the best model candidate)
- We also develop cost-constrained P2L achieving Pareto frontier
Use case 2: Domain-Specific Leaderboards
P2L can aggregate rankings of prompts within a category to produce an adaptive category ranking →
e.g., Find the best models for SQL queries instantly!
BREAKING: @xAI early version of Grok-3 (codename "chocolate") is now #1 in Arena! 🏆
Grok-3 is:
- First-ever model to break 1400 score!
- #1 across all categories, a milestone that keeps getting harder to achieve
Huge congratulations to @xAI on this milestone! View thread 🧵 for more insights into Grok-3's performance after ~8K votes in the Arena.
Here you can see @xai Grok-3’s performance across all the top categories:
🔹 Overall w/ Style Control
🔹 Hard Prompts & Hard Prompt w/ Style Control
🔹 Coding
🔹 Math
🔹 Creative Writing
🔹 Instruction Following
🔹 Longer Query
🔹 Multi-Turn
In the Coding category, Grok-3 surpassed top reasoning models like o1 and Gemini-thinking.