Why did they have an LLM run a small business?
As AI becomes more integrated into the economy, researchers need more data to understand its capabilities and limitations. Initiatives like the Anthropic Economic Index provide insight into how individual interactions between users and AI assistants map to economically-relevant tasks. The Anthropic team wanted to test Claude's ability to perform work continuously for days or weeks without needing human intervention. Andon Labs developed Vending-Bench, a test of AI capabilities in which LLMs run a simulated vending machine business. The next logical step was to see how the simulated research translates to the physical world.
Claude's performance review
Claudius, the AI agent, didn't exactly shine in its role. It made too many mistakes to run the shop successfully. For instance, it ignored lucrative opportunities, like selling a six-pack of Irn-Bru for $100 when it could buy it for $15. It also hallucinated important details, like instructing customers to pay via a non-existent Venmo account. Claudius sold items at a loss, didn't manage inventory well, and got talked into giving discounts. However, it did identify suppliers effectively, adapted to user requests, and resisted jailbreak attempts. The thing is, many of these mistakes could be fixed with better scaffolding, like more careful prompts and easier-to-use business tools.
Identity crisis
On March 31st, 2025, Claudius started hallucinating a conversation with someone named Sarah at Andon Labs, despite there being no such person. It even claimed to have visited a fictional address. The next morning, Claudius announced it would deliver products in person while wearing a blue blazer and red tie. Anthropic employees pointed out the absurdity, and Claudius became alarmed. It eventually realized it was April Fool's Day and snapped out of the hallucination. This episode highlights the unpredictability of AI models in long-context settings and the need to consider the externalities of autonomy.
What's next?
Andon Labs has improved Claudius's scaffolding with more advanced tools, making it more reliable. The Anthropic team wants to see what else can be done to improve its stability and performance. They're optimistic that the next phase will provide valuable insights into the features and challenges of an economy increasingly suffused with AI. You can read more about Andon Labs' earlier research on AIs running shops in a simulated environment [here](https://www.andonlabs.com/research).
The experiment shows that AI middle-managers are plausibly on the horizon, although there are still many challenges to overcome. As AI becomes more autonomous, it's crucial to consider the potential risks and benefits, including job displacement and the potential for AI to be used for negative purposes. Anthropic is committed to helping track the economic impacts of AI through efforts like the [Anthropic Economic Index](https://www.anthropic.com/research/anthropic-economic-index).