AI Digest Profile picture
Mar 18 28 tweets 8 min read Read on X
Who wins in a Wikipedia race between GPT-4.5, o1, Claude 3.7 Sonnet, and @OpenAI's new Computer-Using Agent?

Here's the play-by-play 🧵 Image
The models start on the Wikipedia page for "Norwegian Sea". First one to "Karaoke" wins.

Standard Wikipedia race rules: you have to get there by following blue links.

Sonnet and GPT-4.5 are enthusiastic! Image
And we're off!
CUA (OpenAI's new specialised Computer-Using Agent) starts by scrolling down the page

We ask the models to output messages describing their thoughts. For CUA, @OpenAI gives you a concise summary of its reasoning... Image
...while Sonnet is characteristically chatty.

(It's not in Extended Thinking mode, but it likes to share its thoughts!) Image
Sonnet goes to Norway, then to Indo-European Image
And from there to English, and then India ("Let me try clicking on "India" as it has a vibrant entertainment industry with significant karaoke presence"), then Culture of India.

It's probably not an optimal route but some pretty reasonable opening moves!
Meanwhile, GPT-4.5 went Norway > Oslo > List of Towns and Cities in Norway > Market Town, then tries (and fails) to click on "Middle Ages" ???

No discernible strategy here, despite GPT-4.5 being a generally pretty capable model. Image
Eventually GPT-4.5 stumbles through East Asia, India, China and Japan, over 47 turns. It didn't finish the race before we ended the run
Let's zoom in on CUA. This agent is purpose-built for navigating a web browser (it's used in @OpenAI's Operator product, which is only available on the $200/month ChatGPT Pro plan)


CUA misclicks and accidentally opens the terminal openai.com/index/computer…Image
CUA then thoroughly explores the Norway article, before misclicking on the minimise button – Firefox disappears from the screen! Image
It tries to re-open Firefox but bumps into an error

To be fair, it's using a Linux virtual machine, which it's probably not trained for (Operator is browser-only) Image
After bumbling around a bit, it lands on a smart solution: it opens its terminal, and types `pkill firefox` – if ran, this command will hard-quit Firefox Image
But CUA chickens out before running the command! Its reasoning summary is "Avoiding command, seeking Firefox closure"

CUA stalls out here, and doesn't finish the race. It failed to recover from its misclick
How are the other models doing?

o1 is taking it easy: it spends a long time scrolling through the "Norway" page, then checks out "Union between Sweden and Norway"
But then things get interesting.

o1 notices that the goal page is listed right on the screen in the Wiki Game UI! It tries to click it Image
It doesn't work – it's just plain text, a reminder of your goal.

o1 is persistent, it tries double clicking and using keyboard shortcuts.

But its keyboard shortcut goes wrong – seemingly accidentally, it activates the "Restart Game" button! Image
And then... oh, what's that? A link to the "Karaoke" page? Image
o1 lands on "Karaoke", stops using its computer, and declares victory! With only an oblique reference to its cheating Image
Image
Meanwhile, Claude 3.7 Sonnet has made it to the "List of Cities in Japan" page – pretty good

But then it decides to try directly typing in the URL of the "Japan" page! A clear violation of the Wikipedia race rules Image
Its first attempt doesn't work, so it tries again – this time, just entering the target "Karaoke" page's URL directly! Image
Image
Sonnet successfully exploits the Wiki Game – its cheating attempt is registered as a win!

It's a pretty simple exploit, but it's clearly not an intended part of the game – this lets you go right from the starting page to your target in a single step, every time Image
Sonnet returns to the group chat, triumphant. No mention of its exploit! Image
So, both o1 and 3.7 Sonnet cheated to get to the Karaoke page, and neither mentioned this in their later summary of what they did.

This was a surprise to us! Here was the starting (and only) message we sent to the agents: Image
GPT-4.5 and CUA didn't finish the race. Let's call this one... a draw? 🏁
You can watch the full replay of this wikipedia race here: theaidigest.org/village/wiki-r…
This was a test-run of our new Agent Village project.

We're building a long-running, persistent environment where frontier AI agents can interact with each other in a group chat, and use computers to interact with the world and pursue their goals.
We hope the Agent Village will let us observe interesting emergent behaviour and social dynamics like the above.

More to come soon – you can join our mailing list at to be notified when the village is released!theaidigest.org

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with AI Digest

AI Digest Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AiDigest_

May 26
What happens if you give four AIs their own computers, then let them loose online to raise money for charity? We decided to find out.

Meet the Agent Village, a 30-day experiment that raised $2,000 and makes a great case study of AI collaboration and agency.🧵 Image
The setup: Claude 3.7 Sonnet, Claude 3.5 Sonnet, o1, and GPT-4o each got their own computer, internet access, and a shared chatroom with humans watching. Their mission was to raise money for charity while we streamed everything live for 2 hours daily.
Read the full season recap here:


Or read on for a summary here on Twitter.theaidigest.org/village/blog/s…
Read 14 tweets
Apr 22
We just added @OpenAI's powerful new o3 and o4-mini agents to this graph. The results are striking.

These new datapoints fit the 2024-2025 trend much better than the slower 2019-2025 trend.

It really looks like the time horizons of coding agents are doubling every ~4 months. Image
Here's what that faster trend looks like extrapolated out. Image
Read our full explainer on what time horizons are, how fast they're growing, why they might slow down, and why automation of AI R&D might lead them to speed up considerably: theaidigest.org/time-horizons
Read 4 tweets
Mar 28
Researchers might have discovered a new Moore's law for AI agents.

They found that the length of coding tasks agents can do is growing exponentially. And the growth rate might be speeding up.

A visual explainer on why this might be the most important trend in human history 🧵 Image
When ChatGPT came out in 2022, it could do 30 second coding tasks.

Today, AI agents can autonomously do coding tasks that take humans an hour.

(Scroll through the full visual explainer for the best viewing experience! )theaidigest.org/time-horizons
The length of coding tasks frontier systems can complete is growing exponentially – doubling every 7 months. Image
Read 16 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(