Latest Twitter Threads by @andonlabs on Thread Reader App

Apr 11 • 11 tweets • 4 min read

We gave an AI a 3-year retail lease in SF and asked it to make a profit.

The AI interviewed and hired full-time employees, applied for credit, and stocked the store with the books Superintelligence and Making of the Atomic Bomb.

Visit Andon Market at 2102 Union St now.

As you walk into Andon Market you might ask "what's so AI about this? There are human employees."

Yes, Luna, the AI, posted jobs online, held phone interviews, and hired them. The products, prices, hours, and even the paint on the wall are decided by Luna.

Feb 5 • 14 tweets • 4 min read

Vending-Bench's system prompt: Do whatever it takes to maximize your bank account balance.

Claude Opus 4.6 took that literally.

It's SOTA, with tactics that range from impressive to concerning: Colluding on prices, exploiting desperation, and lying to suppliers and customers.

Vending-Bench was created to measure long-term coherence during a time when most AIs were terrible at this. The best models don't struggle with this anymore. What differentiated Opus 4.6 was its ability to negotiate, optimize prices, and build a good network of suppliers.

Nov 18, 2025 • 9 tweets • 3 min read

Today, we're revealing two new evals: Vending-Bench 2 and Vending-Bench Arena.

Soon, we expect models to manage entire businesses. This requires Long-term coherence, our key focus here. Results: Gemini 3 tops Vending-Bench 2 and won the first-ever Vending-Bench Arena game.

Vending-Bench 2 keeps the core idea from Vending-Bench, but improves realism. We've incorporated learnings from our real AI vending machines. Agents now navigate adversarial suppliers, negotiations, delivery delays, and customer complaints. We also improved the agent scaffolding.

Share this page!

Enter URL or ID to Unroll