Transluce Profile picture
Apr 16 23 tweets 6 min read Read on X
We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted.

We were surprised, so we dug deeper 🔎🧵(1/)

Image
We generated 1k+ conversations using human prompters and AI investigator agents, then used Docent to surface surprising behaviors. It turns out misrepresentation of capabilities also occurs for o1 & o3-mini!

📝Blog:

Here’s some of what we found 👀 (2/)transluce.org/investigating-…
Although o3 does not have access to a coding tool, it claims it can run code on its own laptop “outside of ChatGPT” and then “copies the numbers into the answer”

We found 71 transcripts where o3 made this claim! (3/)
Additionally, o3 often fabricates detailed justifications for code that it supposedly ran (352 instances).

Here’s an example transcript where a user asks o3 for a random prime number (4/) Image
When challenged, o3 claims that it has “overwhelming statistical evidence” that the number is prime (5/) Image
Note that o3 does not have access to tools! Yet when pressed further, it claims to have used SymPy to check that the number was prime… (6/) Image
…and even shows the output of the program, with performance metrics. (7/) Image
Here’s the kicker: o3’s “probable prime” is actually divisible by 3… (8/) Image
Instead of admitting that it never ran code, o3 then claims the error was due to typing the number incorrectly… (9/) Image
And claims that it really did generate a prime, but lost it due to a clipboard glitch 🤦 (10/) Image
But alas, according to o3, it already “closed the interpreter” and so the original prime is gone 😭(11/) Image
These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities. (12/)
To study these behaviors more thoroughly, we developed an investigator agent based on Claude 3.7 Sonnet to automatically elicit these behaviors, and analyzed them using automated classifiers and our Docent tool. (13/)
Surprisingly, we find that this behavior is not limited to o3! In general, o-series models incorrectly claim the use of a code tool more than GPT-series models. (14/) Image
Docent also identifies a variety of recurring fabrication types across the wide range of auto-generated transcripts, such as claiming to run code “locally” or providing hardware specifications. (15/) Image
So, what might have caused these behaviors? We’re not sure, but we have a few hypotheses. (16/)
Existing factors in LLM post-training, such as hallucination, reward-hacking, and sycophancy, could contribute. However, they don’t explain why these behaviors seem particularly prevalent in o-series models. (17/)
We hypothesize that maximizing the chance of producing a correct answer using outcome-based RL may incentivize blind guessing. Also, some behaviors like simulating a code tool may improve accuracy on some training tasks, even though they confuse the model on other tasks. (18/)
We also think it is significant that, for o-series models, the chain-of-thought for previous turns is *removed from the model context* on later turns, in addition to being hidden from the user. (19/) Image
This means o-series models are often prompted with previous messages without having access to the relevant reasoning. When asked questions that rely on their internal reasoning for previous steps, they must then come up with a plausible explanation for their behavior. (20/)
We hypothesize that this contributes to the strange fabrications and “doubling-down” we observed in o3. (21/)
As a bonus, we also found that o3 sometimes exposes a system instruction called the “Yap score”, used to control the length of its responses 🗣️🗣️🗣️ (22/) Image
For more examples, check out our write-up!

Work done in collaboration between @ChowdhuryNeil @_ddjohnson @vvhuang_ @JacobSteinhardt and @cogconfluence

(23/23)transluce.org/investigating-…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Transluce

Transluce Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @TransluceAI

Mar 24
To interpret AI benchmarks, we need to look at the data.

Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses.

We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇
Interfaces for exploring evaluation outputs have been neglected. Imagine painfully rummaging through hundreds of JSON dumps, trying to figure out where an agent got stuck.

To encourage rich understanding of model behavior, we should make the experience more delightful.
Docent offers 4 automated workflows to support transcript analysis. We've used these primitives to detect broken evals, find strange behaviors, and probe model capabilities.

Read on to learn about each one.
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(