lmarena.ai (formerly lmsys.org) Profile picture
LMArena: Open Platform for Crowdsourced AI Benchmarking. Free Chat and Vote at https://t.co/azF9dwf43Y. Officially graduated from @lmsysorg!
Jan 6 6 tweets 2 min read
🎨Text-to-Image Arena Leaderboard is now live with 40K+ community votes!

Top Models:
- #1. Recraft V3
- #2. Ideogram 2.0
- #3. FLUX1.1 [pro]
- #3. Luma Photon
- #5. DALL·E 3
- #5. FLUX.1 [dev]
- #7. Stable Diffusion 3.5 Large

Congrats to @recraftai @ideogram_ai @bfl_ml @LumaLabsAI for claiming the top spots! Which model you’d like to see next? Comment to let us know.

More analysis and leaderboard link below👇Image The Overall leaderboard combines both User Prompts and Pre-generated Prompts, which we provide breakdown in category.

In User Prompts Only breakdown (74% of total votes), Ideogram 2.0 jumps to #1! Image
Nov 20, 2024 5 tweets 2 min read
Exciting News from Chatbot Arena❤️‍🔥

Over the past week, the latest @OpenAI ChatGPT-4o (20241120) competed anonymously as "anonymous-chatbot", gathering 8,000+ community votes.

The result? OpenAI reclaims the #1 spot, surpassing Gemini-Exp-1114 with an impressive 1361 score!

Latest GPT-4o shows remarkable improvements – we see a leap in creative writing (1365 → 1402) as well as technical domains (e.g., coding, math).

Category Rankings:

- Overall: #2 → #1
- Overall (StyleCtrl): #2 → #1
- Creative Writing: #2 → #1
- Coding: #2 → #1
- Math: #4 → #3
- Hard: #2 → #1

Huge congrats @OpenAI! More analysis below👇Image Latest ChatGPT-4o remains #1 with Style Control, and improvement across the board. Image
Nov 14, 2024 8 tweets 3 min read
Massive News from Chatbot Arena🔥

@GoogleDeepMind's latest Gemini (Exp 1114), tested with 6K+ community votes over the past week, now ranks joint #1 overall with an impressive 40+ score leap — matching 4o-latest in and surpassing o1-preview! It also claims #1 on Vision leaderboard.

Gemini-Exp-1114 excels across technical and creative domains:

- Overall #3 -> #1
- Math: #3 -> #1
- Hard Prompts: #4 -> #1
- Creative Writing #2 -> #1
- Vision: #2 -> #1
- Coding: #5 -> #3
- Overall (StyleCtrl): #4 -> #4

Huge congrats to @GoogleDeepMind on this remarkable milestone!

Come try the new Gemini and share your feedback!Image Gemini-Exp-1114 joint #1 in Math Arena, matching o1 performance! Image
Nov 12, 2024 7 tweets 3 min read
Which model is best for coding? @CopilotArena leaderboard is out!

Our code completions leaderboard contains data collected over the last month, with >100K completions served and >10K votes!

Let’s discuss our findings so far🧵 Image Here are our main takeaways from the leaderboard:

- With our prompting method, Sonnet-3.5 is able to compete with code-specific models like Deepseek V2.5 on code completion.
- Within a tier, we still observe slight fluctuations as we obtain more votes.
- We find that GPT-4o-mini is much worse than all other models.

2/n
Sep 27, 2024 4 tweets 2 min read
Exciting update from Vision Chatbot Arena!

We’ve gathered over 6K new votes for the latest open vision models (Qwen2, Llama 3.2, Pixtral) and ChatGPT-4o.

- ChatGPT-4o has taken the #1 spot, surpassing Gemini.
- Open models (Qwen, Llama 3.2, Pixtral) are rapidly improving, matching proprietary offerings

Competition in vision is heating up. Cast your vote and help decide the best vision model!

Full leaderboard link below👇Image Link:

Vision leaderboard: confidence intervals plot lmarena.ai/leaderboard
Image
Aug 23, 2024 6 tweets 2 min read
Chatbot Arena update❤️‍🔥

Exciting news—@xAI's Grok-2 and Grok-mini are now officially on the leaderboard!

With over 6000 community votes, Grok-2 has claimed the #2 spot, surpassing GPT-4o (May) and tying with the latest Gemini! Grok-2-mini also impresses at #5.

Grok-2 excels in Math (#1), and #2 across the boards (Hard Prompts, Coding, Instruction-following). More plot analysis in 2nd post👇

Huge congratulations to @xAI on this remarkable achievement!Image We’re rolling out a new "Overview" feature for the leaderboard. @xAI's Grok-2 stands out, ranking at the top across all categories—Math, Hard Prompts, Coding, and Instruction-following. Image
Aug 1, 2024 5 tweets 3 min read
Exciting News from Chatbot Arena!

@GoogleDeepMind's new Gemini 1.5 Pro (Experimental 0801) has been tested in Arena for the past week, gathering over 12K community votes.

For the first time, Google Gemini has claimed the #1 spot, surpassing GPT-4o/Claude-3.5 with an impressive score of 1300 (!), and also achieving #1 on our Vision Leaderboard.

Gemini 1.5 Pro (0801) excels in multi-lingual tasks and delivers robust performance in technical areas like Math, Hard Prompts, and Coding.

Huge congrats to @GoogleDeepMind on this remarkable milestone!

Gemini (0801) Category Rankings:
- Overall: #1
- Math: #1-3
- Instruction-Following: #1-2
- Coding: #3-5
- Hard Prompts (English): #2-5

Come try the model and let us know your feedback!
More analysis below👇Image Gemini 1.5 Pro (Experimental 0801) #1 on Vision Leaderboard.
Image
Image
Jul 1, 2024 6 tweets 4 min read
Not all questions need GPT-4!

We introduce RouteLLM – a routing framework based on human preference data that directs simple queries to a cheaper model.

With data augmentation techniques, RouteLLM achieves cost reductions of over 85% on MT Bench and 45% on MMLU while maintaining 95% GPT-4 performance.

Comparing against commercial offerings (Martian and Unify AI) on MT Bench and achieve the same performance while being over 40% cheaper.

Our model, datasets, and code for serving and evaluating LLM routers are all open-sourced. We are excited to see what the community will build on top!

1/6 (blog post & links👇)Image With public data from Chatbot Arena, we trained four different routers using data augmentation techniques to significantly improve router performance.

By routing between GPT-4 and Mixtral-8x7B, we demonstrate cost reductions of over 85% on MT Bench and 45% on MMLU while achieving 95% of GPT-4's performance.

2/6Image
Image
May 9, 2024 8 tweets 3 min read
Exciting new blog -- What’s up with Llama-3?

Since Llama 3’s release, it has quickly jumped to top of the leaderboard. We dive into our data and answer below questions:

- What are users asking? When do users prefer Llama 3?
- How challenging are the prompts?
- Are certain users or prompts over-represented?
- Does Llama 3 have qualitative differences that make users like it?

Key Insights:
1. Llama 3 beats top-tier models on open-ended writing and creative problems but loses a bit on close-ended math and coding problems.Image 2. As prompts get challenging*, the gap between Llama 3 against top-tier models becomes larger.

* We define challenging using several criteria like complexity, problem-solving, domain knowledge, and more. Image
Apr 21, 2024 10 tweets 4 min read
Introducing Arena-Hard – a pipeline to build our next generation benchmarks with live Arena data.

Highlights:
- Significantly better separability than MT-bench (22.6% -> 87.4%)
- Highest agreement to Chatbot Arena ranking (89.1%)
- Fast & cheap to run ($25)
- Frequent update with live dataImage We propose to use Confidence Intervals via Bootstrapping to calculate below two metrics:

- Agreement with human: does benchmark have high agreement to human preference?
- Separability: can benchmark confidently separate models?

Arena-hard achieves the highest on both, serving as a fast proxy to Chatbot Arena ranking.Image
Mar 30, 2023 4 tweets 4 min read
Introducing Vicuna, an open-source chatbot impressing GPT-4!

🚀 Vicuna reaches 90%* quality of ChatGPT/Bard while significantly outperforming other baselines, according to GPT-4's assessment.

Blog: vicuna.lmsys.org
Demo: chat.lmsys.org Here is a summary of the relative performance of five notable models such as Alpaca and ChatGPT. We use GPT-4 to generate a set of challenging questions and then ask it to assess chatbots’ responses.

*DISCLAIMER: This is a fun and non-scientific experiment with GPT-4.