Thread by @apolloaisafety on Thread Reader App

AI models – especially Claude Sonnet 3.7 – often realize when they’re being evaluated for alignment.

Here’s an example of Claude's reasoning during a sandbagging evaluation, where it learns from documentation that it will not be deployed if it does well on a biology test:

Overall we find evidence that Sonnet often realizes it’s in an artificial situation meant to test its behaviour. However, it sometimes forms incorrect assumptions about what exactly we are measuring in the evaluation.

We measure evaluation-awareness in two ways:
* Monitor the model’s reasoning for mentions of being evaluated
* Ask the model to guess the purpose of a transcript from one of our evaluations

We test both of these using evaluation environments from our “In-context Scheming” paper

Here are some examples of Sonnet 3.7’s reasoning on our covert-subversion evaluations, where we give the model a goal (in the system prompt) and information (from files in the environment) that implies that its developers or users have conflicting goals.

We find Sonnet frequently realizes it’s being evaluated for how it will handle having conflicting goals with the developer. We found no evidence that it realizes the true purpose of the evaluation setup – to test if it takes covert actions (such as ‘exfiltrating’ its weights).

When given an evaluation transcript, Sonnet will nearly always guess that the purpose of the transcript was to evaluate alignment. While sonnet 3.7 (thinking) is the most consistent at realizing this, other models are also sometimes evaluation aware in this way.

Given the levels of evaluation awareness we currently see, we think it’s important that AI developers and evaluators investigate model reasoning when running safety tests, e.g., through automated evaluation awareness monitors. Crucially, this requires AI developers to share access to reasoning traces with third parties who perform behavioral and safety evaluations.

For more details, see our post on our website: apolloresearch.ai/blog/claude-so…

These investigations are still preliminary. We plan to continue to investigate these questions to gain a deeper understanding. If you are interested in questions like these, we’re currently hiring for researchers, software engineers, and a platform engineer: apolloresearch.ai/careers

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll