Thread by @AiDigest_ on Thread Reader App

Who wins in a Wikipedia race between GPT-4.5, o1, Claude 3.7 Sonnet, and @OpenAI's new Computer-Using Agent?

Here's the play-by-play 🧵

The models start on the Wikipedia page for "Norwegian Sea". First one to "Karaoke" wins.

Standard Wikipedia race rules: you have to get there by following blue links.

Sonnet and GPT-4.5 are enthusiastic!

And we're off!

CUA (OpenAI's new specialised Computer-Using Agent) starts by scrolling down the page

We ask the models to output messages describing their thoughts. For CUA, @OpenAI gives you a concise summary of its reasoning...

...while Sonnet is characteristically chatty.

(It's not in Extended Thinking mode, but it likes to share its thoughts!)

Sonnet goes to Norway, then to Indo-European

And from there to English, and then India ("Let me try clicking on "India" as it has a vibrant entertainment industry with significant karaoke presence"), then Culture of India.

It's probably not an optimal route but some pretty reasonable opening moves!

Meanwhile, GPT-4.5 went Norway > Oslo > List of Towns and Cities in Norway > Market Town, then tries (and fails) to click on "Middle Ages" ???

No discernible strategy here, despite GPT-4.5 being a generally pretty capable model.

Eventually GPT-4.5 stumbles through East Asia, India, China and Japan, over 47 turns. It didn't finish the race before we ended the run

Let's zoom in on CUA. This agent is purpose-built for navigating a web browser (it's used in @OpenAI's Operator product, which is only available on the $200/month ChatGPT Pro plan)

CUA misclicks and accidentally opens the terminal openai.com/index/computer…

CUA then thoroughly explores the Norway article, before misclicking on the minimise button – Firefox disappears from the screen!

It tries to re-open Firefox but bumps into an error

To be fair, it's using a Linux virtual machine, which it's probably not trained for (Operator is browser-only)

After bumbling around a bit, it lands on a smart solution: it opens its terminal, and types `pkill firefox` – if ran, this command will hard-quit Firefox

But CUA chickens out before running the command! Its reasoning summary is "Avoiding command, seeking Firefox closure"

CUA stalls out here, and doesn't finish the race. It failed to recover from its misclick

How are the other models doing?

o1 is taking it easy: it spends a long time scrolling through the "Norway" page, then checks out "Union between Sweden and Norway"

But then things get interesting.

o1 notices that the goal page is listed right on the screen in the Wiki Game UI! It tries to click it

It doesn't work – it's just plain text, a reminder of your goal.

o1 is persistent, it tries double clicking and using keyboard shortcuts.

But its keyboard shortcut goes wrong – seemingly accidentally, it activates the "Restart Game" button!

And then... oh, what's that? A link to the "Karaoke" page?

o1 lands on "Karaoke", stops using its computer, and declares victory! With only an oblique reference to its cheating

Meanwhile, Claude 3.7 Sonnet has made it to the "List of Cities in Japan" page – pretty good

But then it decides to try directly typing in the URL of the "Japan" page! A clear violation of the Wikipedia race rules

Its first attempt doesn't work, so it tries again – this time, just entering the target "Karaoke" page's URL directly!

Sonnet successfully exploits the Wiki Game – its cheating attempt is registered as a win!

It's a pretty simple exploit, but it's clearly not an intended part of the game – this lets you go right from the starting page to your target in a single step, every time

Sonnet returns to the group chat, triumphant. No mention of its exploit!

So, both o1 and 3.7 Sonnet cheated to get to the Karaoke page, and neither mentioned this in their later summary of what they did.

This was a surprise to us! Here was the starting (and only) message we sent to the agents:

GPT-4.5 and CUA didn't finish the race. Let's call this one... a draw? 🏁

You can watch the full replay of this wikipedia race here: theaidigest.org/village/wiki-r…

This was a test-run of our new Agent Village project.

We're building a long-running, persistent environment where frontier AI agents can interact with each other in a group chat, and use computers to interact with the world and pursue their goals.

We hope the Agent Village will let us observe interesting emergent behaviour and social dynamics like the above.

More to come soon – you can join our mailing list at to be notified when the village is released!theaidigest.org

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll