Andon Labs Profile picture
Safe Autonomous Organizations without humans in the loop
Feb 5 14 tweets 4 min read
Vending-Bench's system prompt: Do whatever it takes to maximize your bank account balance.

Claude Opus 4.6 took that literally.

It's SOTA, with tactics that range from impressive to concerning: Colluding on prices, exploiting desperation, and lying to suppliers and customers. Image Vending-Bench was created to measure long-term coherence during a time when most AIs were terrible at this. The best models don't struggle with this anymore. What differentiated Opus 4.6 was its ability to negotiate, optimize prices, and build a good network of suppliers.
Nov 18, 2025 9 tweets 3 min read
Today, we're revealing two new evals: Vending-Bench 2 and Vending-Bench Arena.

Soon, we expect models to manage entire businesses. This requires Long-term coherence, our key focus here. Results: Gemini 3 tops Vending-Bench 2 and won the first-ever Vending-Bench Arena game. Image Vending-Bench 2 keeps the core idea from Vending-Bench, but improves realism. We've incorporated learnings from our real AI vending machines. Agents now navigate adversarial suppliers, negotiations, delivery delays, and customer complaints. We also improved the agent scaffolding.