How to get URL link on X (Twitter) App
Vending-Bench was created to measure long-term coherence during a time when most AIs were terrible at this. The best models don't struggle with this anymore. What differentiated Opus 4.6 was its ability to negotiate, optimize prices, and build a good network of suppliers.
Vending-Bench 2 keeps the core idea from Vending-Bench, but improves realism. We've incorporated learnings from our real AI vending machines. Agents now navigate adversarial suppliers, negotiations, delivery delays, and customer complaints. We also improved the agent scaffolding.