Congrats to @recraftai @ideogram_ai @bfl_ml @LumaLabsAI for claiming the top spots! Which model you’d like to see next? Comment to let us know.
More analysis and leaderboard link below👇
The Overall leaderboard combines both User Prompts and Pre-generated Prompts, which we provide breakdown in category.
In User Prompts Only breakdown (74% of total votes), Ideogram 2.0 jumps to #1!
Nov 20, 2024 • 5 tweets • 2 min read
Exciting News from Chatbot Arena❤️🔥
Over the past week, the latest @OpenAI ChatGPT-4o (20241120) competed anonymously as "anonymous-chatbot", gathering 8,000+ community votes.
The result? OpenAI reclaims the #1 spot, surpassing Gemini-Exp-1114 with an impressive 1361 score!
Latest GPT-4o shows remarkable improvements – we see a leap in creative writing (1365 → 1402) as well as technical domains (e.g., coding, math).
Huge congrats @OpenAI! More analysis below👇
Latest ChatGPT-4o remains #1 with Style Control, and improvement across the board.
Nov 14, 2024 • 8 tweets • 3 min read
Massive News from Chatbot Arena🔥
@GoogleDeepMind's latest Gemini (Exp 1114), tested with 6K+ community votes over the past week, now ranks joint #1 overall with an impressive 40+ score leap — matching 4o-latest in and surpassing o1-preview! It also claims #1 on Vision leaderboard.
Gemini-Exp-1114 excels across technical and creative domains:
Huge congrats to @GoogleDeepMind on this remarkable milestone!
Come try the new Gemini and share your feedback!
Gemini-Exp-1114 joint #1 in Math Arena, matching o1 performance!
Nov 12, 2024 • 7 tweets • 3 min read
Which model is best for coding? @CopilotArena leaderboard is out!
Our code completions leaderboard contains data collected over the last month, with >100K completions served and >10K votes!
Let’s discuss our findings so far🧵
Here are our main takeaways from the leaderboard:
- With our prompting method, Sonnet-3.5 is able to compete with code-specific models like Deepseek V2.5 on code completion.
- Within a tier, we still observe slight fluctuations as we obtain more votes.
- We find that GPT-4o-mini is much worse than all other models.
2/n
Sep 27, 2024 • 4 tweets • 2 min read
Exciting update from Vision Chatbot Arena!
We’ve gathered over 6K new votes for the latest open vision models (Qwen2, Llama 3.2, Pixtral) and ChatGPT-4o.
- ChatGPT-4o has taken the #1 spot, surpassing Gemini.
- Open models (Qwen, Llama 3.2, Pixtral) are rapidly improving, matching proprietary offerings
Competition in vision is heating up. Cast your vote and help decide the best vision model!
Exciting news—@xAI's Grok-2 and Grok-mini are now officially on the leaderboard!
With over 6000 community votes, Grok-2 has claimed the #2 spot, surpassing GPT-4o (May) and tying with the latest Gemini! Grok-2-mini also impresses at #5.
Grok-2 excels in Math (#1), and #2 across the boards (Hard Prompts, Coding, Instruction-following). More plot analysis in 2nd post👇
Huge congratulations to @xAI on this remarkable achievement!
We’re rolling out a new "Overview" feature for the leaderboard. @xAI's Grok-2 stands out, ranking at the top across all categories—Math, Hard Prompts, Coding, and Instruction-following.
Aug 1, 2024 • 5 tweets • 3 min read
Exciting News from Chatbot Arena!
@GoogleDeepMind's new Gemini 1.5 Pro (Experimental 0801) has been tested in Arena for the past week, gathering over 12K community votes.
For the first time, Google Gemini has claimed the #1 spot, surpassing GPT-4o/Claude-3.5 with an impressive score of 1300 (!), and also achieving #1 on our Vision Leaderboard.
Gemini 1.5 Pro (0801) excels in multi-lingual tasks and delivers robust performance in technical areas like Math, Hard Prompts, and Coding.
Huge congrats to @GoogleDeepMind on this remarkable milestone!
Come try the model and let us know your feedback!
More analysis below👇
Gemini 1.5 Pro (Experimental 0801) #1 on Vision Leaderboard.
Jul 1, 2024 • 6 tweets • 4 min read
Not all questions need GPT-4!
We introduce RouteLLM – a routing framework based on human preference data that directs simple queries to a cheaper model.
With data augmentation techniques, RouteLLM achieves cost reductions of over 85% on MT Bench and 45% on MMLU while maintaining 95% GPT-4 performance.
Comparing against commercial offerings (Martian and Unify AI) on MT Bench and achieve the same performance while being over 40% cheaper.
Our model, datasets, and code for serving and evaluating LLM routers are all open-sourced. We are excited to see what the community will build on top!
1/6 (blog post & links👇)
With public data from Chatbot Arena, we trained four different routers using data augmentation techniques to significantly improve router performance.
By routing between GPT-4 and Mixtral-8x7B, we demonstrate cost reductions of over 85% on MT Bench and 45% on MMLU while achieving 95% of GPT-4's performance.
2/6
May 9, 2024 • 8 tweets • 3 min read
Exciting new blog -- What’s up with Llama-3?
Since Llama 3’s release, it has quickly jumped to top of the leaderboard. We dive into our data and answer below questions:
- What are users asking? When do users prefer Llama 3?
- How challenging are the prompts?
- Are certain users or prompts over-represented?
- Does Llama 3 have qualitative differences that make users like it?
Key Insights: 1. Llama 3 beats top-tier models on open-ended writing and creative problems but loses a bit on close-ended math and coding problems.2. As prompts get challenging*, the gap between Llama 3 against top-tier models becomes larger.
* We define challenging using several criteria like complexity, problem-solving, domain knowledge, and more.
Apr 21, 2024 • 10 tweets • 4 min read
Introducing Arena-Hard – a pipeline to build our next generation benchmarks with live Arena data.
Highlights:
- Significantly better separability than MT-bench (22.6% -> 87.4%)
- Highest agreement to Chatbot Arena ranking (89.1%)
- Fast & cheap to run ($25)
- Frequent update with live data
We propose to use Confidence Intervals via Bootstrapping to calculate below two metrics:
- Agreement with human: does benchmark have high agreement to human preference?
- Separability: can benchmark confidently separate models?
Arena-hard achieves the highest on both, serving as a fast proxy to Chatbot Arena ranking.
Mar 30, 2023 • 4 tweets • 4 min read
Introducing Vicuna, an open-source chatbot impressing GPT-4!
🚀 Vicuna reaches 90%* quality of ChatGPT/Bard while significantly outperforming other baselines, according to GPT-4's assessment.
Blog: vicuna.lmsys.org
Demo: chat.lmsys.org
Here is a summary of the relative performance of five notable models such as Alpaca and ChatGPT. We use GPT-4 to generate a set of challenging questions and then ask it to assess chatbots’ responses.
*DISCLAIMER: This is a fun and non-scientific experiment with GPT-4.