Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Aarush Sah

@AarushSah_

Aug 14 • 10 tweets • 2 min read • Read on X

OpenBench v0.3.0 is live! 🚀

Massive provider expansion: 18 new model providers (now 30+ total!),

Also added: alpha support for the SciCode & GraphWalks benchmarks, and CLI improvements.

The most provider-agnostic eval framework just got even better.

2/ 📡 Our theme for 0.3.0 is making it super easy to run benchmarks across all models, no matter who's running it.

3/ OpenBench already had first-class support for @OpenAI, @AnthropicAI, @GroqInc @GoogleDeepMind, and others -- and now we’ve expanded it to include:
– @AI21Labs, @basetenco, @CerebrasSystems, @cohere,
– @CrusoeAI, @DeepInfra, @friendliai, @huggingface,
– @hyperbolic_labs, @LambdaAPI, @Hailuo_AI, @Kimi_Moonshot,
– @nebiusai, @NousResearch, @novita_labs, @parasailnetwork, @RekaAILabs, @SambaNovaAI.
That’s 30+ providers total - all accessible in one place.

4/ To further our commitment of building in public, we've decided to add alpha benchmarks to OpenBench. We may not be super confident in our implementations of benchmarks, but we'd love for developers to be able to give us feedback on our implementations of evals before they're solid.

5/ To start, we've added SciCode & GraphWalks as alpha evals. But what are they?

6/ 🧬 SciCode - A new frontier for code + science

We've added SciCode, which tests models on real scientific computing problems from physics, chemistry, and biology.

Unlike HumanEval, these require domain knowledge AND coding ability - a true test of reasoning capability.

7/ 🕸️ GraphWalks - Testing graph reasoning (h/t @natashaamayorga)

Can your model navigate complex graph structures? GraphWalks tests:
- Path finding
- Connectivity analysis
- Graph traversal logic

Split into 2 variants for comprehensive evaluation - BFS and Parents.

8/ ⚡ More Housekeeping, mainly CLI improvements make evals even faster:

• openbench alias for bench so you can run evals with uvx openbench instead of having to install openbench in a venv
• -M and -T flags for quick model/task args, taken from Inspect AI's CLI
• --debug mode for troubleshooting retries and evals
• --alpha flag unlocks experimental benchmarks

Small changes, big workflow improvements.

9/ 📤 Direct HuggingFace integration (h/t @ben_burtenshaw)

Thanks to the great folks @huggingface Your eval results can now be pushed directly to HuggingFace datasets - making it easier to share benchmark results with the community and track model performance over time. All you need to do is add --hf-repo to your eval run and it'll push the logfiles up!

10/ OpenBench continues to be the most flexible, provider-agnostic eval framework.

One codebase, 30+ providers, 35+ benchmarks.

Star if you find it useful, and follow along for 0.4.0 next week! ⭐

github.com/groq/openbench

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @AarushSah_

Aarush Sah

@AarushSah_

Aug 11

OpenBench v0.2.0 is here 🚀

Big coverage jump: 17 new benchmarks across math, reasoning, reading comp, health, long-context recall, plus first-class support for local evals.

We also have full OpenAI Simple-Evals parity!

2/ We've added a lot of new evaluations in 0.2.0:
- MATH + MATH-500
- MGSM (multilingual math)
- DROP (reading comprehension)
- HealthBench (medical QA)
- Humanity’s Last Exam (HLE)
- OpenAI MRCR (long-context recall)

3/ Let's take a quick look at each of the evals we've added and what they're useful for.

Read 11 tweets

Aarush Sah

@AarushSah_

Jul 31

Introducing OpenBench 0.1: Open, Reproducible Evals 🧵

Evaluating large language models today is messy—every eval framework has its own way of prompting, parsing responses, and measuring accuracy. This makes comparisons impossible. How do you know Anthropic and OpenAI evaluate MMLU the same way?

Even perfect documentation won't save you from small implementation quirks. Benchmark results end up practically irreproducible.

Read 9 tweets

Aarush Sah

@AarushSah_

Jul 16, 2024

Introducing Eris: A Novel Evaluation Framework Using Debate Simulations

Eris pits leading AI models against each other in structured debates, assessing reasoning, knowledge, and communication skills simultaneously.
1/ 🧵

How Eris works:

- Two LLMs are assigned opposing positions on a randomly selected topic
- They engage in a full academic debate structure: constructive speeches, cross-examinations, rebuttals, and closing arguments
- A separate judge LLM (currently Claude 3.5 Sonnet) evaluates the debate on multiple criteria
- Results are aggregated across many debates to produce win rates and comparative metrics

Some key insights from our initial results:

- Qwen-72B-Chat showed surprisingly competitive performance against stronger models
- Claude 3.5 Sonnet ranked higher than GPT-4o in aggregate rankings, but GPT-4o was more performant in head-to-head matchups
- Some models (e.g., Mixtral-8x22B, Mistral-Large) significantly underperformed, possibly due to suboptimal prompting strategies
- Results reveal nuanced strengths and weaknesses of different models in debate contexts

Read 8 tweets

Aarush Sah

@AarushSah_

Jul 11, 2024

🚨New Benchmark Alert!🚨

Introducing Set-Eval: a novel multimodal benchmark for testing visual reasoning capabilities of large language models.

Claude 3.5 Sonnet has a score double that of GPT-4o, and both are below 15%!

More details, precise scores, and analysis below: 🧵

First, what are the rules of Set?

- 12 cards are laid out
- Each card has 4 features: color, shape, number, and shading
- A valid set is 3 cards where for each, it's either all the same or all different across the 3 cards
- No two cards can be identical

The task of the model is to identify a single valid set.

For example, a valid set for this arrangement could be:
One Green Empty Oval,
Two Purple Empty Squiggles, and
Three Red Empty Diamonds.

Read 6 tweets

Aarush Sah

@AarushSah_

Mar 12, 2024

I hacked together a quick implementation of @alexalbert__'s prompt engineering workflow! An explanation 🧵:

1/github.com/AarushSah/prom…

@alexalbert__ 1/ Prompt optimizer is a variation of Alex's workflow that automates the creation of test cases and prompt refinement, while still keeping humans in the loop.

@alexalbert__ 2/ It does this by

1. Asking the user for a rough prompt,
2. Automatically generating test cases for the prompt and running them,
3. Getting feedback regarding the model's responses from the user,
4. and writing an improved prompt.

Read 7 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Aarush Sah

Try unrolling a thread yourself!

More from @AarushSah_

Aarush Sah

Aarush Sah

Aarush Sah

Aarush Sah

Aarush Sah

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!