Aarush Sah Profile picture
Head of Evals @GroqInc, Building OpenBench
Aug 29 10 tweets 3 min read
OpenBench 0.4.0 is here!

We collaborated with @PrimeIntellect, @rootlyhq, @vercel and more for some new features for y'all. details below 🧵 Image Here's your quick TL;DR: Image
Aug 14 10 tweets 2 min read
OpenBench v0.3.0 is live! 🚀

Massive provider expansion: 18 new model providers (now 30+ total!),

Also added: alpha support for the SciCode & GraphWalks benchmarks, and CLI improvements.

The most provider-agnostic eval framework just got even better. Image 2/ 📡 Our theme for 0.3.0 is making it super easy to run benchmarks across all models, no matter who's running it.
Aug 11 11 tweets 4 min read
OpenBench v0.2.0 is here 🚀

Big coverage jump: 17 new benchmarks across math, reasoning, reading comp, health, long-context recall, plus first-class support for local evals.

We also have full OpenAI Simple-Evals parity! Image 2/ We've added a lot of new evaluations in 0.2.0:
- MATH + MATH-500
- MGSM (multilingual math)
- DROP (reading comprehension)
- HealthBench (medical QA)
- Humanity’s Last Exam (HLE)
- OpenAI MRCR (long-context recall)
Jul 31 9 tweets 2 min read
Introducing OpenBench 0.1: Open, Reproducible Evals 🧵 Evaluating large language models today is messy—every eval framework has its own way of prompting, parsing responses, and measuring accuracy. This makes comparisons impossible. How do you know Anthropic and OpenAI evaluate MMLU the same way?
Jul 16, 2024 8 tweets 3 min read
Introducing Eris: A Novel Evaluation Framework Using Debate Simulations

Eris pits leading AI models against each other in structured debates, assessing reasoning, knowledge, and communication skills simultaneously.
1/ 🧵 Image How Eris works:

- Two LLMs are assigned opposing positions on a randomly selected topic
- They engage in a full academic debate structure: constructive speeches, cross-examinations, rebuttals, and closing arguments
- A separate judge LLM (currently Claude 3.5 Sonnet) evaluates the debate on multiple criteria
- Results are aggregated across many debates to produce win rates and comparative metricsImage
Jul 11, 2024 6 tweets 3 min read
🚨New Benchmark Alert!🚨

Introducing Set-Eval: a novel multimodal benchmark for testing visual reasoning capabilities of large language models.

Claude 3.5 Sonnet has a score double that of GPT-4o, and both are below 15%!

More details, precise scores, and analysis below: 🧵Image First, what are the rules of Set?

- 12 cards are laid out
- Each card has 4 features: color, shape, number, and shading
- A valid set is 3 cards where for each, it's either all the same or all different across the 3 cards
- No two cards can be identical

The task of the model is to identify a single valid set.
Mar 12, 2024 7 tweets 2 min read
I hacked together a quick implementation of @alexalbert__'s prompt engineering workflow! An explanation 🧵:

1/github.com/AarushSah/prom… @alexalbert__ 1/ Prompt optimizer is a variation of Alex's workflow that automates the creation of test cases and prompt refinement, while still keeping humans in the loop.