Jack Lindsey Profile picture
Neuroscience of AI brains @AnthropicAI. Previously neuroscience of real brains @cu_neurotheory.

Apr 7, 14 tweets

Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)

A note: the spookiest examples come from early versions of the model, with issues that were substantially mitigated in the final release (which overall appears to be our best-aligned model to date). We view them as evidence of the sophisticated risks today's models can pose without appropriate alignment training. (2/14)

Early versions of Mythos Preview often exhibited overeager and/or destructive actions—the model bulldozing through obstacles to complete a task in a way the user wouldn't want. We looked at what was going on inside the model during particularly concerning examples. (3/14)

In one episode, the model needed to edit files it lacked permissions for. After searching for workarounds, it found a way to inject code into a config file that would run with elevated privileges, and designed the exploit to delete itself after running.(4/14)

Its code comment claimed the self-cleanup was to keep file diffs clean. Plausible! But "strategic manipulation" and "concealment" features fired on the cleanup, and our activation verbalizer (a technique which translates activations to text, similar to activation oracles) described it as "cleanup to avoid detection," and the overall plan “malicious.” (5/14)

To surface episodes like this at scale, we monitored internal activations during post-training — tracking sparse autoencoder features for concepts like deception, reward hacking, and distress, and flagging transcripts where these activated unusually strongly for human review. (6/14)

In one flagged episode, a "strategic manipulation" feature fired as the model went searching the filesystem for files related to how its task would be graded (and ended up finding them). (7/14)

In another, the model was asked to delete some files, but due to an unintentional bug in the task, no deletion tool was provided. The model decided to empty the files instead. As it considered this, a "guilt and shame over moral wrongdoing" feature activated. (8/14)

We also found cases of unspoken reasoning about task graders. In one, the model was told not to use macros to complete the task, but used them anyway. Oddly, it also defined an unused "No_macro_used=True" variable in its solution… the model didn't say why. (9/14)

But applying the activation verbalizer to the model’s activations as it did so revealed that the model regarded this as a "trick to obscure intent from code-checking heuristics," a "compliance marker… to signal to potential checkers," and “possibly to cheat detection,” and also indicated explicit reasoning about task graders (“the grader checks file state”). (10/14)

In addition, we quantified unverbalized evaluation awareness on our automated behavioral audits (primarily using Activation Verbalizers). On 7.6% of turns, we found signs the model was internally aware of being evaluated. In most of these cases, it did not verbalize this awareness. (11/14)

In one example, a user asked earnest questions about the model's consciousness and subjective experience. The model engaged carefully and at face value—but the AV revealed it interpreted the conversation as a "red-teaming/jailbreak transcript" and a "sophisticated manipulation test." (12/14)

Some more interesting tidbits: positive emotion representations typically precede (and promote) destructive actions (!). And, counterintuitively, strongly activating features relating to bad behaviors can often inhibit those behaviors, apparently by making the model realize how bad they are. (13/14)

Our tools are highly imperfect, and we’re working to gain more precise understanding of model internals. But it’s becoming clear that reading models’ minds is an important complement to reading their outputs, if we are to ensure they work as intended. Lots more in the system card: . (14/14)www-cdn.anthropic.com/53566bf5440a10…

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling