Latest Twitter Threads by @w33lliam on Thread Reader App

Jun 23, 2025 • 11 tweets • 4 min read

Excited to share 🤯 that our LMUnit models with @ContextualAI just claimed the top spots on RewardBench2 🥇

How did we manage to rank +5% higher than models like Gemini, Claude 4, and GPT4.1? More in the details below:

🧵 1/11

As a quick recap, in LMUnit, we utilize "natural language unit tests," which decompose response quality into explicit, testable criteria.

Instead of relying on opaque metrics like "pick the better response," each quality aspect becomes a specific question that humans can consistently answer

2/11

Share this page!

Enter URL or ID to Unroll