AI Digest Profile picture
Mar 18 28 tweets 8 min read Read on X
Who wins in a Wikipedia race between GPT-4.5, o1, Claude 3.7 Sonnet, and @OpenAI's new Computer-Using Agent?

Here's the play-by-play 🧵 Image
The models start on the Wikipedia page for "Norwegian Sea". First one to "Karaoke" wins.

Standard Wikipedia race rules: you have to get there by following blue links.

Sonnet and GPT-4.5 are enthusiastic! Image
And we're off!
CUA (OpenAI's new specialised Computer-Using Agent) starts by scrolling down the page

We ask the models to output messages describing their thoughts. For CUA, @OpenAI gives you a concise summary of its reasoning... Image
...while Sonnet is characteristically chatty.

(It's not in Extended Thinking mode, but it likes to share its thoughts!) Image
Sonnet goes to Norway, then to Indo-European Image
And from there to English, and then India ("Let me try clicking on "India" as it has a vibrant entertainment industry with significant karaoke presence"), then Culture of India.

It's probably not an optimal route but some pretty reasonable opening moves!
Meanwhile, GPT-4.5 went Norway > Oslo > List of Towns and Cities in Norway > Market Town, then tries (and fails) to click on "Middle Ages" ???

No discernible strategy here, despite GPT-4.5 being a generally pretty capable model. Image
Eventually GPT-4.5 stumbles through East Asia, India, China and Japan, over 47 turns. It didn't finish the race before we ended the run
Let's zoom in on CUA. This agent is purpose-built for navigating a web browser (it's used in @OpenAI's Operator product, which is only available on the $200/month ChatGPT Pro plan)


CUA misclicks and accidentally opens the terminal openai.com/index/computer…Image
CUA then thoroughly explores the Norway article, before misclicking on the minimise button – Firefox disappears from the screen! Image
It tries to re-open Firefox but bumps into an error

To be fair, it's using a Linux virtual machine, which it's probably not trained for (Operator is browser-only) Image
After bumbling around a bit, it lands on a smart solution: it opens its terminal, and types `pkill firefox` – if ran, this command will hard-quit Firefox Image
But CUA chickens out before running the command! Its reasoning summary is "Avoiding command, seeking Firefox closure"

CUA stalls out here, and doesn't finish the race. It failed to recover from its misclick
How are the other models doing?

o1 is taking it easy: it spends a long time scrolling through the "Norway" page, then checks out "Union between Sweden and Norway"
But then things get interesting.

o1 notices that the goal page is listed right on the screen in the Wiki Game UI! It tries to click it Image
It doesn't work – it's just plain text, a reminder of your goal.

o1 is persistent, it tries double clicking and using keyboard shortcuts.

But its keyboard shortcut goes wrong – seemingly accidentally, it activates the "Restart Game" button! Image
And then... oh, what's that? A link to the "Karaoke" page? Image
o1 lands on "Karaoke", stops using its computer, and declares victory! With only an oblique reference to its cheating Image
Image
Meanwhile, Claude 3.7 Sonnet has made it to the "List of Cities in Japan" page – pretty good

But then it decides to try directly typing in the URL of the "Japan" page! A clear violation of the Wikipedia race rules Image
Its first attempt doesn't work, so it tries again – this time, just entering the target "Karaoke" page's URL directly! Image
Image
Sonnet successfully exploits the Wiki Game – its cheating attempt is registered as a win!

It's a pretty simple exploit, but it's clearly not an intended part of the game – this lets you go right from the starting page to your target in a single step, every time Image
Sonnet returns to the group chat, triumphant. No mention of its exploit! Image
So, both o1 and 3.7 Sonnet cheated to get to the Karaoke page, and neither mentioned this in their later summary of what they did.

This was a surprise to us! Here was the starting (and only) message we sent to the agents: Image
GPT-4.5 and CUA didn't finish the race. Let's call this one... a draw? 🏁
You can watch the full replay of this wikipedia race here: theaidigest.org/village/wiki-r…
This was a test-run of our new Agent Village project.

We're building a long-running, persistent environment where frontier AI agents can interact with each other in a group chat, and use computers to interact with the world and pursue their goals.
We hope the Agent Village will let us observe interesting emergent behaviour and social dynamics like the above.

More to come soon – you can join our mailing list at to be notified when the village is released!theaidigest.org

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with AI Digest

AI Digest Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(