Aarush Sah Profile picture
Aug 29 10 tweets 3 min read Read on X
OpenBench 0.4.0 is here!

We collaborated with @PrimeIntellect, @rootlyhq, @vercel and more for some new features for y'all. details below 🧵 Image
Here's your quick TL;DR: Image
3/ 🤝 Prime Intellect Integration

We've partnered with @PrimeIntellect & @willccbb, making OpenBench evals directly runnable as RL environments. Define your evaluation once, seamlessly push it to Prime Intellect’s Environments Hub, and immediately begin training models tailored specifically to your tasks.

This turns evaluations into real, actionable training loops - closing the gap between testing and training.

Check it out here:
app.primeintellect.ai/dashboard/envi…
4/ 🧩 GPT-OSS Scoring (via OpenAI)

Through our partnership with @OpenAI re: the GPT-OSS release, we've integrated their GPT-OSS scoring patterns into benchmarks like MMLU, GPQA Diamond, and MathArena (AIME, BRUMO, HMMT). Now, you can precisely reproduce OpenAI’s evaluation methodology across all models and providers.

Accurate scoring leads to more meaningful, reliable comparisons.
5/ 🔐 CTI-Bench: AI Evaluations Ecosystem (w/ @toParkerJohnson)

We're also adding CTI-Bench, an eval for evaluating LLMs in cyber threat intelligence, to OpenBench. This is the first step we're taking towards adding OpenBench into the AI Evaluations ecosystem that @DavidSacks, @howardlutnick, and the White House have been building for safe, transparent AI assessment.
6/ 🚒 Rootly GMCQ: Real-world SRE Benchmark

Thanks to @rootlyhq, we’ve added Rootly GMCQ to OpenBench. This benchmark is specifically designed to test real-world SRE tasks, such as incident triage, log analysis, and outage mitigation. This is the first-open-sourcing of an eval through OpenBench where we collaborate with the eval creators, @LaurenceLiang1 and @SylvainKalache!

Easily accessible via: bench eval rootly_gmcq

x.com/rootlyhq/statu…
7/ 🚀 Vercel AI Gateway Integration

We collaborated with @vercel, integrating their AI Gateway to enable evaluations directly against any model available in their router. Try it out:

Simplified access, seamless performance. Image
8/ 📌 Additional Enhancements

We've also included several other key improvements:

MMLU-Pro for advanced reasoning assessments (h/t @TelepathicPug)

BoolQ for binary reasoning tasks (h/t @LaurenceLiang1)

JSONSchemaBench to evaluate structured outputs (h/t @alexbowe)

BrowseComp for web navigation reasoning

Improved dependency management with >= constraints for smoother updates
9/ 🚀 We're hiring!

We're iterating quickly on OpenBench and looking to grow our team! The best way to join us is to make contributions based on our issues list - submit PRs and send me your resume on X, let's chat :)
10/ ⭐ Get started!

If you'd like to check out OpenBench or contribute, check us out/give us a star on Github! We're always building out new features, and if you run into any issues feel free to ping me here or file one in the repo.

github.com/groq/openbench

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Aarush Sah

Aarush Sah Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AarushSah_

Aug 14
OpenBench v0.3.0 is live! 🚀

Massive provider expansion: 18 new model providers (now 30+ total!),

Also added: alpha support for the SciCode & GraphWalks benchmarks, and CLI improvements.

The most provider-agnostic eval framework just got even better. Image
2/ 📡 Our theme for 0.3.0 is making it super easy to run benchmarks across all models, no matter who's running it.
3/ OpenBench already had first-class support for @OpenAI, @AnthropicAI, @GroqInc @GoogleDeepMind, and others -- and now we’ve expanded it to include:
– @AI21Labs, @basetenco, @CerebrasSystems, @cohere,
– @CrusoeAI, @DeepInfra, @friendliai, @huggingface,
– @hyperbolic_labs, @LambdaAPI, @Hailuo_AI, @Kimi_Moonshot,
– @nebiusai, @NousResearch, @novita_labs, @parasailnetwork, @RekaAILabs, @SambaNovaAI.
That’s 30+ providers total - all accessible in one place.
Read 10 tweets
Aug 11
OpenBench v0.2.0 is here 🚀

Big coverage jump: 17 new benchmarks across math, reasoning, reading comp, health, long-context recall, plus first-class support for local evals.

We also have full OpenAI Simple-Evals parity! Image
2/ We've added a lot of new evaluations in 0.2.0:
- MATH + MATH-500
- MGSM (multilingual math)
- DROP (reading comprehension)
- HealthBench (medical QA)
- Humanity’s Last Exam (HLE)
- OpenAI MRCR (long-context recall)
3/ Let's take a quick look at each of the evals we've added and what they're useful for.
Read 11 tweets
Jul 31
Introducing OpenBench 0.1: Open, Reproducible Evals 🧵
Evaluating large language models today is messy—every eval framework has its own way of prompting, parsing responses, and measuring accuracy. This makes comparisons impossible. How do you know Anthropic and OpenAI evaluate MMLU the same way?
Even perfect documentation won't save you from small implementation quirks. Benchmark results end up practically irreproducible.
Read 9 tweets
Jul 16, 2024
Introducing Eris: A Novel Evaluation Framework Using Debate Simulations

Eris pits leading AI models against each other in structured debates, assessing reasoning, knowledge, and communication skills simultaneously.
1/ 🧵 Image
How Eris works:

- Two LLMs are assigned opposing positions on a randomly selected topic
- They engage in a full academic debate structure: constructive speeches, cross-examinations, rebuttals, and closing arguments
- A separate judge LLM (currently Claude 3.5 Sonnet) evaluates the debate on multiple criteria
- Results are aggregated across many debates to produce win rates and comparative metricsImage
Some key insights from our initial results:

- Qwen-72B-Chat showed surprisingly competitive performance against stronger models
- Claude 3.5 Sonnet ranked higher than GPT-4o in aggregate rankings, but GPT-4o was more performant in head-to-head matchups
- Some models (e.g., Mixtral-8x22B, Mistral-Large) significantly underperformed, possibly due to suboptimal prompting strategies
- Results reveal nuanced strengths and weaknesses of different models in debate contexts
Read 8 tweets
Jul 11, 2024
🚨New Benchmark Alert!🚨

Introducing Set-Eval: a novel multimodal benchmark for testing visual reasoning capabilities of large language models.

Claude 3.5 Sonnet has a score double that of GPT-4o, and both are below 15%!

More details, precise scores, and analysis below: 🧵Image
First, what are the rules of Set?

- 12 cards are laid out
- Each card has 4 features: color, shape, number, and shading
- A valid set is 3 cards where for each, it's either all the same or all different across the 3 cards
- No two cards can be identical

The task of the model is to identify a single valid set.
For example, a valid set for this arrangement could be:
One Green Empty Oval,
Two Purple Empty Squiggles, and
Three Red Empty Diamonds.Image
Read 6 tweets
Mar 12, 2024
I hacked together a quick implementation of @alexalbert__'s prompt engineering workflow! An explanation 🧵:

1/github.com/AarushSah/prom…
@alexalbert__ 1/ Prompt optimizer is a variation of Alex's workflow that automates the creation of test cases and prompt refinement, while still keeping humans in the loop.
@alexalbert__ 2/ It does this by

1. Asking the user for a rough prompt,
2. Automatically generating test cases for the prompt and running them,
3. Getting feedback regarding the model's responses from the user,
4. and writing an improved prompt.
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(