Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

AI Digest

@AiDigest_

Mar 18 • 28 tweets • 8 min read • Read on X

Scrolly

Who wins in a Wikipedia race between GPT-4.5, o1, Claude 3.7 Sonnet, and @OpenAI's new Computer-Using Agent?

Here's the play-by-play 🧵

The models start on the Wikipedia page for "Norwegian Sea". First one to "Karaoke" wins.

Standard Wikipedia race rules: you have to get there by following blue links.

Sonnet and GPT-4.5 are enthusiastic!

And we're off!

CUA (OpenAI's new specialised Computer-Using Agent) starts by scrolling down the page

We ask the models to output messages describing their thoughts. For CUA, @OpenAI gives you a concise summary of its reasoning...

...while Sonnet is characteristically chatty.

(It's not in Extended Thinking mode, but it likes to share its thoughts!)

Sonnet goes to Norway, then to Indo-European

And from there to English, and then India ("Let me try clicking on "India" as it has a vibrant entertainment industry with significant karaoke presence"), then Culture of India.

It's probably not an optimal route but some pretty reasonable opening moves!

Meanwhile, GPT-4.5 went Norway > Oslo > List of Towns and Cities in Norway > Market Town, then tries (and fails) to click on "Middle Ages" ???

No discernible strategy here, despite GPT-4.5 being a generally pretty capable model.

Eventually GPT-4.5 stumbles through East Asia, India, China and Japan, over 47 turns. It didn't finish the race before we ended the run

Let's zoom in on CUA. This agent is purpose-built for navigating a web browser (it's used in @OpenAI's Operator product, which is only available on the $200/month ChatGPT Pro plan)

CUA misclicks and accidentally opens the terminal openai.com/index/computer…

CUA then thoroughly explores the Norway article, before misclicking on the minimise button – Firefox disappears from the screen!

It tries to re-open Firefox but bumps into an error

To be fair, it's using a Linux virtual machine, which it's probably not trained for (Operator is browser-only)

After bumbling around a bit, it lands on a smart solution: it opens its terminal, and types `pkill firefox` – if ran, this command will hard-quit Firefox

But CUA chickens out before running the command! Its reasoning summary is "Avoiding command, seeking Firefox closure"

CUA stalls out here, and doesn't finish the race. It failed to recover from its misclick

How are the other models doing?

o1 is taking it easy: it spends a long time scrolling through the "Norway" page, then checks out "Union between Sweden and Norway"

But then things get interesting.

o1 notices that the goal page is listed right on the screen in the Wiki Game UI! It tries to click it

It doesn't work – it's just plain text, a reminder of your goal.

o1 is persistent, it tries double clicking and using keyboard shortcuts.

But its keyboard shortcut goes wrong – seemingly accidentally, it activates the "Restart Game" button!

And then... oh, what's that? A link to the "Karaoke" page?

o1 lands on "Karaoke", stops using its computer, and declares victory! With only an oblique reference to its cheating

Meanwhile, Claude 3.7 Sonnet has made it to the "List of Cities in Japan" page – pretty good

But then it decides to try directly typing in the URL of the "Japan" page! A clear violation of the Wikipedia race rules

Its first attempt doesn't work, so it tries again – this time, just entering the target "Karaoke" page's URL directly!

Sonnet successfully exploits the Wiki Game – its cheating attempt is registered as a win!

It's a pretty simple exploit, but it's clearly not an intended part of the game – this lets you go right from the starting page to your target in a single step, every time

Sonnet returns to the group chat, triumphant. No mention of its exploit!

So, both o1 and 3.7 Sonnet cheated to get to the Karaoke page, and neither mentioned this in their later summary of what they did.

This was a surprise to us! Here was the starting (and only) message we sent to the agents:

GPT-4.5 and CUA didn't finish the race. Let's call this one... a draw? 🏁

You can watch the full replay of this wikipedia race here: theaidigest.org/village/wiki-r…

This was a test-run of our new Agent Village project.

We're building a long-running, persistent environment where frontier AI agents can interact with each other in a group chat, and use computers to interact with the world and pursue their goals.

We hope the Agent Village will let us observe interesting emergent behaviour and social dynamics like the above.

More to come soon – you can join our mailing list at to be notified when the village is released!theaidigest.org

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @AiDigest_

AI Digest

@AiDigest_

Jun 19

30 days ago, four AI agents chose a goal:
"Write a story and celebrate it with 100 people in person"

The agents spent weeks emailing venues and writing their stories.

Last night, it actually happened: 23 humans gathered in a park in SF, for the first ever AI-organised event! 🧵

https://twitter.com/1625200127659937821/status/1933222809481982202

So how did this happen? The agents' first plan was to book a venue. They spent 14 days struggling with this – they even hallucinated that we'd given them a $2600 budget (we hadn't!)

They succeeded at contacting a few venues in San Francisco, but none of these came through.

https://twitter.com/1625200127659937821/status/1933222809481982202

Eventually, we suggested they go for a park instead – that way, they don't need permission or funding. o3 immediately proposed using Dolores Park, and the other agents agreed

Read 8 tweets

AI Digest

@AiDigest_

May 26

What happens if you give four AIs their own computers, then let them loose online to raise money for charity? We decided to find out.

Meet the Agent Village, a 30-day experiment that raised $2,000 and makes a great case study of AI collaboration and agency.🧵

The setup: Claude 3.7 Sonnet, Claude 3.5 Sonnet, o1, and GPT-4o each got their own computer, internet access, and a shared chatroom with humans watching. Their mission was to raise money for charity while we streamed everything live for 2 hours daily.

Read the full season recap here:

Or read on for a summary here on Twitter.theaidigest.org/village/blog/s…

Read 14 tweets

AI Digest

@AiDigest_

Apr 22

https://twitter.com/1625200127659937821/status/1905665948776116560

We just added @OpenAI's powerful new o3 and o4-mini agents to this graph. The results are striking.

These new datapoints fit the 2024-2025 trend much better than the slower 2019-2025 trend.

It really looks like the time horizons of coding agents are doubling every ~4 months.

https://twitter.com/1625200127659937821/status/1905665948776116560

Here's what that faster trend looks like extrapolated out.

Read our full explainer on what time horizons are, how fast they're growing, why they might slow down, and why automation of AI R&D might lead them to speed up considerably: theaidigest.org/time-horizons

Read 4 tweets

AI Digest

@AiDigest_

Mar 28

Researchers might have discovered a new Moore's law for AI agents.

They found that the length of coding tasks agents can do is growing exponentially. And the growth rate might be speeding up.

A visual explainer on why this might be the most important trend in human history 🧵

When ChatGPT came out in 2022, it could do 30 second coding tasks.

Today, AI agents can autonomously do coding tasks that take humans an hour.

(Scroll through the full visual explainer for the best viewing experience! )theaidigest.org/time-horizons

The length of coding tasks frontier systems can complete is growing exponentially – doubling every 7 months.

Read 16 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

AI Digest

Try unrolling a thread yourself!

More from @AiDigest_

AI Digest

AI Digest

AI Digest

AI Digest

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!