@GoogleDeepMind's new Gemini 1.5 Pro (Experimental 0801) has been tested in Arena for the past week, gathering over 12K community votes.
For the first time, Google Gemini has claimed the #1 spot, surpassing GPT-4o/Claude-3.5 with an impressive score of 1300 (!), and also achieving #1 on our Vision Leaderboard.
Gemini 1.5 Pro (0801) excels in multi-lingual tasks and delivers robust performance in technical areas like Math, Hard Prompts, and Coding.
Huge congrats to @GoogleDeepMind on this remarkable milestone!
Congrats to @recraftai @ideogram_ai @bfl_ml @LumaLabsAI for claiming the top spots! Which model you’d like to see next? Comment to let us know.
More analysis and leaderboard link below👇
The Overall leaderboard combines both User Prompts and Pre-generated Prompts, which we provide breakdown in category.
In User Prompts Only breakdown (74% of total votes), Ideogram 2.0 jumps to #1!
Interestingly, model rankings shift in the Pre-generated Prompts (26% of total votes). FLUX/Stable Diffusion improves significantly and DALLE/Ideogram drop.
@GoogleDeepMind's latest Gemini (Exp 1114), tested with 6K+ community votes over the past week, now ranks joint #1 overall with an impressive 40+ score leap — matching 4o-latest in and surpassing o1-preview! It also claims #1 on Vision leaderboard.
Gemini-Exp-1114 excels across technical and creative domains:
Which model is best for coding? @CopilotArena leaderboard is out!
Our code completions leaderboard contains data collected over the last month, with >100K completions served and >10K votes!
Let’s discuss our findings so far🧵
Here are our main takeaways from the leaderboard:
- With our prompting method, Sonnet-3.5 is able to compete with code-specific models like Deepseek V2.5 on code completion.
- Within a tier, we still observe slight fluctuations as we obtain more votes.
- We find that GPT-4o-mini is much worse than all other models.
2/n
Most current Copilot Arena users code in Python, followed by javascript/typescript, html/markdown, and C++.
Exciting news—@xAI's Grok-2 and Grok-mini are now officially on the leaderboard!
With over 6000 community votes, Grok-2 has claimed the #2 spot, surpassing GPT-4o (May) and tying with the latest Gemini! Grok-2-mini also impresses at #5.
Grok-2 excels in Math (#1), and #2 across the boards (Hard Prompts, Coding, Instruction-following). More plot analysis in 2nd post👇
Huge congratulations to @xAI on this remarkable achievement!
We’re rolling out a new "Overview" feature for the leaderboard. @xAI's Grok-2 stands out, ranking at the top across all categories—Math, Hard Prompts, Coding, and Instruction-following.