@GoogleDeepMind's new Gemini 1.5 Pro (Experimental 0801) has been tested in Arena for the past week, gathering over 12K community votes.
For the first time, Google Gemini has claimed the #1 spot, surpassing GPT-4o/Claude-3.5 with an impressive score of 1300 (!), and also achieving #1 on our Vision Leaderboard.
Gemini 1.5 Pro (0801) excels in multi-lingual tasks and delivers robust performance in technical areas like Math, Hard Prompts, and Coding.
Huge congrats to @GoogleDeepMind on this remarkable milestone!
We introduce RouteLLM – a routing framework based on human preference data that directs simple queries to a cheaper model.
With data augmentation techniques, RouteLLM achieves cost reductions of over 85% on MT Bench and 45% on MMLU while maintaining 95% GPT-4 performance.
Comparing against commercial offerings (Martian and Unify AI) on MT Bench and achieve the same performance while being over 40% cheaper.
Our model, datasets, and code for serving and evaluating LLM routers are all open-sourced. We are excited to see what the community will build on top!
1/6 (blog post & links👇)
With public data from Chatbot Arena, we trained four different routers using data augmentation techniques to significantly improve router performance.
By routing between GPT-4 and Mixtral-8x7B, we demonstrate cost reductions of over 85% on MT Bench and 45% on MMLU while achieving 95% of GPT-4's performance.
2/6
We compare our router performance to commercial offerings (Martian and Unify AI) on MT Bench with strong results, achieving the same performance as these commercial routers while being over 40% cheaper.
Since Llama 3’s release, it has quickly jumped to top of the leaderboard. We dive into our data and answer below questions:
- What are users asking? When do users prefer Llama 3?
- How challenging are the prompts?
- Are certain users or prompts over-represented?
- Does Llama 3 have qualitative differences that make users like it?
Key Insights: 1. Llama 3 beats top-tier models on open-ended writing and creative problems but loses a bit on close-ended math and coding problems.
2. As prompts get challenging*, the gap between Llama 3 against top-tier models becomes larger.
* We define challenging using several criteria like complexity, problem-solving, domain knowledge, and more.
(Cont'd) We show Llama 3-70b-Instruct's win rate conditioned on hierarchical criteria subsets. Some criteria separate the model's strengths and weaknesses.
Introducing Arena-Hard – a pipeline to build our next generation benchmarks with live Arena data.
Highlights:
- Significantly better separability than MT-bench (22.6% -> 87.4%)
- Highest agreement to Chatbot Arena ranking (89.1%)
- Fast & cheap to run ($25)
- Frequent update with live data
We propose to use Confidence Intervals via Bootstrapping to calculate below two metrics:
- Agreement with human: does benchmark have high agreement to human preference?
- Separability: can benchmark confidently separate models?
Arena-hard achieves the highest on both, serving as a fast proxy to Chatbot Arena ranking.
How does Arena-hard pipeline work?
1) Input: 200K Arena user prompts 2) Topic modeling to ensure diversity 3) Key criteria (e.g., domain knowledge, problem-solving) to select high quality topic clusters
Here is a summary of the relative performance of five notable models such as Alpaca and ChatGPT. We use GPT-4 to generate a set of challenging questions and then ask it to assess chatbots’ responses.
*DISCLAIMER: This is a fun and non-scientific experiment with GPT-4.
Through careful prompt engineering, GPT-4 is able to accurately evaluate the response quality in most cases, as shown in the example below.