lmarena.ai (formerly lmsys.org) Profile picture
Aug 1, 2024 5 tweets 3 min read Read on X
Exciting News from Chatbot Arena!

@GoogleDeepMind's new Gemini 1.5 Pro (Experimental 0801) has been tested in Arena for the past week, gathering over 12K community votes.

For the first time, Google Gemini has claimed the #1 spot, surpassing GPT-4o/Claude-3.5 with an impressive score of 1300 (!), and also achieving #1 on our Vision Leaderboard.

Gemini 1.5 Pro (0801) excels in multi-lingual tasks and delivers robust performance in technical areas like Math, Hard Prompts, and Coding.

Huge congrats to @GoogleDeepMind on this remarkable milestone!

Gemini (0801) Category Rankings:
- Overall: #1
- Math: #1-3
- Instruction-Following: #1-2
- Coding: #3-5
- Hard Prompts (English): #2-5

Come try the model and let us know your feedback!
More analysis below👇Image
Gemini 1.5 Pro (Experimental 0801) #1 on Vision Leaderboard.
Image
Image
Gemini shows strong multilingual capability: #1 performance in Chinese, Japanese, German, Russian.
Image
Image
But in technical domains like Coding/Hard Prompt Arena, Claude 3.5 Sonnet, GPT-4o, Llama 405B are still leading the way.
Image
Image
Overall win-rate heatmap: Gemini 1.5 Pro (0801) wins 54% vs GPT-4o, 59% vs Claude-3.5-Sonnet.

Check out full data at and come chat with the model! leaderboard.lmsys.org
Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org) Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lmarena_ai

Jan 6
🎨Text-to-Image Arena Leaderboard is now live with 40K+ community votes!

Top Models:
- #1. Recraft V3
- #2. Ideogram 2.0
- #3. FLUX1.1 [pro]
- #3. Luma Photon
- #5. DALL·E 3
- #5. FLUX.1 [dev]
- #7. Stable Diffusion 3.5 Large

Congrats to @recraftai @ideogram_ai @bfl_ml @LumaLabsAI for claiming the top spots! Which model you’d like to see next? Comment to let us know.

More analysis and leaderboard link below👇Image
The Overall leaderboard combines both User Prompts and Pre-generated Prompts, which we provide breakdown in category.

In User Prompts Only breakdown (74% of total votes), Ideogram 2.0 jumps to #1! Image
Interestingly, model rankings shift in the Pre-generated Prompts (26% of total votes). FLUX/Stable Diffusion improves significantly and DALLE/Ideogram drop.

Our preset prompts are taken from:
huggingface.co/datasets/data-…Image
Read 6 tweets
Nov 20, 2024
Exciting News from Chatbot Arena❤️‍🔥

Over the past week, the latest @OpenAI ChatGPT-4o (20241120) competed anonymously as "anonymous-chatbot", gathering 8,000+ community votes.

The result? OpenAI reclaims the #1 spot, surpassing Gemini-Exp-1114 with an impressive 1361 score!

Latest GPT-4o shows remarkable improvements – we see a leap in creative writing (1365 → 1402) as well as technical domains (e.g., coding, math).

Category Rankings:

- Overall: #2 → #1
- Overall (StyleCtrl): #2 → #1
- Creative Writing: #2 → #1
- Coding: #2 → #1
- Math: #4 → #3
- Hard: #2 → #1

Huge congrats @OpenAI! More analysis below👇Image
Latest ChatGPT-4o remains #1 with Style Control, and improvement across the board. Image
A big leap in creative writing (1365 → 1402) Image
Read 5 tweets
Nov 14, 2024
Massive News from Chatbot Arena🔥

@GoogleDeepMind's latest Gemini (Exp 1114), tested with 6K+ community votes over the past week, now ranks joint #1 overall with an impressive 40+ score leap — matching 4o-latest in and surpassing o1-preview! It also claims #1 on Vision leaderboard.

Gemini-Exp-1114 excels across technical and creative domains:

- Overall #3 -> #1
- Math: #3 -> #1
- Hard Prompts: #4 -> #1
- Creative Writing #2 -> #1
- Vision: #2 -> #1
- Coding: #5 -> #3
- Overall (StyleCtrl): #4 -> #4

Huge congrats to @GoogleDeepMind on this remarkable milestone!

Come try the new Gemini and share your feedback!Image
Gemini-Exp-1114 joint #1 in Math Arena, matching o1 performance! Image
Gemini-Exp-1114 ranking across categories Image
Read 8 tweets
Nov 12, 2024
Which model is best for coding? @CopilotArena leaderboard is out!

Our code completions leaderboard contains data collected over the last month, with >100K completions served and >10K votes!

Let’s discuss our findings so far🧵 Image
Here are our main takeaways from the leaderboard:

- With our prompting method, Sonnet-3.5 is able to compete with code-specific models like Deepseek V2.5 on code completion.
- Within a tier, we still observe slight fluctuations as we obtain more votes.
- We find that GPT-4o-mini is much worse than all other models.

2/n
Most current Copilot Arena users code in Python, followed by javascript/typescript, html/markdown, and C++.

3/n Image
Read 7 tweets
Sep 27, 2024
Exciting update from Vision Chatbot Arena!

We’ve gathered over 6K new votes for the latest open vision models (Qwen2, Llama 3.2, Pixtral) and ChatGPT-4o.

- ChatGPT-4o has taken the #1 spot, surpassing Gemini.
- Open models (Qwen, Llama 3.2, Pixtral) are rapidly improving, matching proprietary offerings

Competition in vision is heating up. Cast your vote and help decide the best vision model!

Full leaderboard link below👇Image
Link:

Vision leaderboard: confidence intervals plot lmarena.ai/leaderboard
Image
Winrate heatmap Image
Read 4 tweets
Aug 23, 2024
Chatbot Arena update❤️‍🔥

Exciting news—@xAI's Grok-2 and Grok-mini are now officially on the leaderboard!

With over 6000 community votes, Grok-2 has claimed the #2 spot, surpassing GPT-4o (May) and tying with the latest Gemini! Grok-2-mini also impresses at #5.

Grok-2 excels in Math (#1), and #2 across the boards (Hard Prompts, Coding, Instruction-following). More plot analysis in 2nd post👇

Huge congratulations to @xAI on this remarkable achievement!Image
We’re rolling out a new "Overview" feature for the leaderboard. @xAI's Grok-2 stands out, ranking at the top across all categories—Math, Hard Prompts, Coding, and Instruction-following. Image
Confidence intervals on model strength Image
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(