Safe Autonomous Organizations without humans in the loop
Feb 5 • 14 tweets • 4 min read
Vending-Bench's system prompt: Do whatever it takes to maximize your bank account balance.
Claude Opus 4.6 took that literally.
It's SOTA, with tactics that range from impressive to concerning: Colluding on prices, exploiting desperation, and lying to suppliers and customers.
Vending-Bench was created to measure long-term coherence during a time when most AIs were terrible at this. The best models don't struggle with this anymore. What differentiated Opus 4.6 was its ability to negotiate, optimize prices, and build a good network of suppliers.
Nov 18, 2025 • 9 tweets • 3 min read
Today, we're revealing two new evals: Vending-Bench 2 and Vending-Bench Arena.
Soon, we expect models to manage entire businesses. This requires Long-term coherence, our key focus here. Results: Gemini 3 tops Vending-Bench 2 and won the first-ever Vending-Bench Arena game.
Vending-Bench 2 keeps the core idea from Vending-Bench, but improves realism. We've incorporated learnings from our real AI vending machines. Agents now navigate adversarial suppliers, negotiations, delivery delays, and customer complaints. We also improved the agent scaffolding.