Transluce Profile picture
Open and scalable technology for understanding AI systems.
Nov 14, 2025 12 tweets 4 min read
Can LMs learn to faithfully describe their internal features and mechanisms?

In our new paper led by Research Fellow @belindazli, we find that they can—and that models explain themselves better than other models do. Image How can we get explanations out of LMs? One easy way is to ask LMs to explain their decisions directly, or monitor their CoTs.
But there’s no guarantee these explanations are faithful to internal mechanisms, and lots of work over the last ~2yrs uncovered cases where they aren’t.
Apr 16, 2025 23 tweets 6 min read
We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted.

We were surprised, so we dug deeper 🔎🧵(1/)

Image We generated 1k+ conversations using human prompters and AI investigator agents, then used Docent to surface surprising behaviors. It turns out misrepresentation of capabilities also occurs for o1 & o3-mini!

📝Blog:

Here’s some of what we found 👀 (2/)transluce.org/investigating-…
Mar 24, 2025 10 tweets 4 min read
To interpret AI benchmarks, we need to look at the data.

Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses.

We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇 Interfaces for exploring evaluation outputs have been neglected. Imagine painfully rummaging through hundreds of JSON dumps, trying to figure out where an agent got stuck.

To encourage rich understanding of model behavior, we should make the experience more delightful.