Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

lmsys.org

@lmsysorg

Aug 1 • 5 tweets • 3 min read • Read on X

Scrolly

Exciting News from Chatbot Arena!

@GoogleDeepMind's new Gemini 1.5 Pro (Experimental 0801) has been tested in Arena for the past week, gathering over 12K community votes.

For the first time, Google Gemini has claimed the #1 spot, surpassing GPT-4o/Claude-3.5 with an impressive score of 1300 (!), and also achieving #1 on our Vision Leaderboard.

Gemini 1.5 Pro (0801) excels in multi-lingual tasks and delivers robust performance in technical areas like Math, Hard Prompts, and Coding.

Huge congrats to @GoogleDeepMind on this remarkable milestone!

Gemini (0801) Category Rankings:
- Overall: #1
- Math: #1-3
- Instruction-Following: #1-2
- Coding: #3-5
- Hard Prompts (English): #2-5

Come try the model and let us know your feedback!
More analysis below👇

Gemini 1.5 Pro (Experimental 0801) #1 on Vision Leaderboard.

Gemini shows strong multilingual capability: #1 performance in Chinese, Japanese, German, Russian.

But in technical domains like Coding/Hard Prompt Arena, Claude 3.5 Sonnet, GPT-4o, Llama 405B are still leading the way.

Overall win-rate heatmap: Gemini 1.5 Pro (0801) wins 54% vs GPT-4o, 59% vs Claude-3.5-Sonnet.

Check out full data at and come chat with the model! leaderboard.lmsys.org

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @lmsysorg

lmsys.org

@lmsysorg

Jul 1

Not all questions need GPT-4!

We introduce RouteLLM – a routing framework based on human preference data that directs simple queries to a cheaper model.

With data augmentation techniques, RouteLLM achieves cost reductions of over 85% on MT Bench and 45% on MMLU while maintaining 95% GPT-4 performance.

Comparing against commercial offerings (Martian and Unify AI) on MT Bench and achieve the same performance while being over 40% cheaper.

Our model, datasets, and code for serving and evaluating LLM routers are all open-sourced. We are excited to see what the community will build on top!

1/6 (blog post & links👇)

With public data from Chatbot Arena, we trained four different routers using data augmentation techniques to significantly improve router performance.

By routing between GPT-4 and Mixtral-8x7B, we demonstrate cost reductions of over 85% on MT Bench and 45% on MMLU while achieving 95% of GPT-4's performance.

2/6

We compare our router performance to commercial offerings (Martian and Unify AI) on MT Bench with strong results, achieving the same performance as these commercial routers while being over 40% cheaper.

(benchmark details on Github)

3/6

Read 6 tweets

lmsys.org

@lmsysorg

May 9

Exciting new blog -- What’s up with Llama-3?

Since Llama 3’s release, it has quickly jumped to top of the leaderboard. We dive into our data and answer below questions:

- What are users asking? When do users prefer Llama 3?
- How challenging are the prompts?
- Are certain users or prompts over-represented?
- Does Llama 3 have qualitative differences that make users like it?

Key Insights:
1. Llama 3 beats top-tier models on open-ended writing and creative problems but loses a bit on close-ended math and coding problems.

2. As prompts get challenging*, the gap between Llama 3 against top-tier models becomes larger.

* We define challenging using several criteria like complexity, problem-solving, domain knowledge, and more.

(Cont'd) We show Llama 3-70b-Instruct's win rate conditioned on hierarchical criteria subsets. Some criteria separate the model's strengths and weaknesses.

Read 8 tweets

lmsys.org

@lmsysorg

Apr 21

Introducing Arena-Hard – a pipeline to build our next generation benchmarks with live Arena data.

Highlights:
- Significantly better separability than MT-bench (22.6% -> 87.4%)
- Highest agreement to Chatbot Arena ranking (89.1%)
- Fast & cheap to run ($25)
- Frequent update with live data

We propose to use Confidence Intervals via Bootstrapping to calculate below two metrics:

- Agreement with human: does benchmark have high agreement to human preference?
- Separability: can benchmark confidently separate models?

Arena-hard achieves the highest on both, serving as a fast proxy to Chatbot Arena ranking.

How does Arena-hard pipeline work?

1) Input: 200K Arena user prompts
2) Topic modeling to ensure diversity
3) Key criteria (e.g., domain knowledge, problem-solving) to select high quality topic clusters

Result: 500 challenging benchmark prompts.

Read 10 tweets

lmsys.org

@lmsysorg

Mar 30, 2023

Introducing Vicuna, an open-source chatbot impressing GPT-4!

🚀 Vicuna reaches 90%* quality of ChatGPT/Bard while significantly outperforming other baselines, according to GPT-4's assessment.

Blog: vicuna.lmsys.org
Demo: chat.lmsys.org

Here is a summary of the relative performance of five notable models such as Alpaca and ChatGPT. We use GPT-4 to generate a set of challenging questions and then ask it to assess chatbots’ responses.

*DISCLAIMER: This is a fun and non-scientific experiment with GPT-4.

Through careful prompt engineering, GPT-4 is able to accurately evaluate the response quality in most cases, as shown in the example below.

More examples: vicuna.lmsys.org/eval/
Code: github.com/lm-sys/FastChat

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

lmsys.org

Try unrolling a thread yourself!

More from @lmsysorg

lmsys.org

lmsys.org

lmsys.org

lmsys.org

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!