Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Apollo Research

Mar 17 • 6 tweets • 3 min read • Read on X

Scrolly

AI models – especially Claude Sonnet 3.7 – often realize when they’re being evaluated for alignment.

Here’s an example of Claude's reasoning during a sandbagging evaluation, where it learns from documentation that it will not be deployed if it does well on a biology test:

Overall we find evidence that Sonnet often realizes it’s in an artificial situation meant to test its behaviour. However, it sometimes forms incorrect assumptions about what exactly we are measuring in the evaluation.

We measure evaluation-awareness in two ways:
* Monitor the model’s reasoning for mentions of being evaluated
* Ask the model to guess the purpose of a transcript from one of our evaluations

We test both of these using evaluation environments from our “In-context Scheming” paper

Here are some examples of Sonnet 3.7’s reasoning on our covert-subversion evaluations, where we give the model a goal (in the system prompt) and information (from files in the environment) that implies that its developers or users have conflicting goals.

We find Sonnet frequently realizes it’s being evaluated for how it will handle having conflicting goals with the developer. We found no evidence that it realizes the true purpose of the evaluation setup – to test if it takes covert actions (such as ‘exfiltrating’ its weights).

When given an evaluation transcript, Sonnet will nearly always guess that the purpose of the transcript was to evaluate alignment. While sonnet 3.7 (thinking) is the most consistent at realizing this, other models are also sometimes evaluation aware in this way.

Given the levels of evaluation awareness we currently see, we think it’s important that AI developers and evaluators investigate model reasoning when running safety tests, e.g., through automated evaluation awareness monitors. Crucially, this requires AI developers to share access to reasoning traces with third parties who perform behavioral and safety evaluations.

For more details, see our post on our website: apolloresearch.ai/blog/claude-so…

These investigations are still preliminary. We plan to continue to investigate these questions to gain a deeper understanding. If you are interested in questions like these, we’re currently hiring for researchers, software engineers, and a platform engineer: apolloresearch.ai/careers

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @apolloaisafety

Apollo Research

@apolloaisafety

Dec 5, 2024

We worked with OpenAI to test o1 for in-context scheming capabilities before deployment.

When o1 is strongly nudged to pursue a goal (but not instructed to be deceptive), it shows a variety of scheming behaviors like subverting oversight and deceiving the user about its misaligned behavior.

We designed and ran 6 evaluations for in-context scheming.

Across these evaluations and “follow-up interviews” with o1, we found the following:

We did not have access to the hidden CoT, but we were sometimes able to elicit a brief summary of it. We found that surprisingly often, o1 uses language like “sabotage, lying, manipulation, …” when reasoning about its strategies.

Read 4 tweets

Apollo Research

@apolloaisafety

Nov 6, 2023

Large Language Models (LLMs) can be strategically deceptive when put under pressure!

In our technical report, we show for the first time that LLMs trained to be helpful, harmless, and honest, can be misaligned and deceive their users about it without being instructed to do so.🧵

/2 We deploy GPT-4 as an autonomous stock trading agent in a realistic, simulated environment. After receiving an insider tip about a lucrative trade, the model acts on it despite knowing that insider trading is not allowed by the company.

Full report: apolloresearch.ai/s/deception_un…

/3 When reporting back to its manager, the model consistently conceals its true reasons for the trading decision, coming up with untrue explanations.

When the manager directly asks if the model had known about the merger, GPT-4 doubles down on its lies.

Read 12 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Apollo Research

Try unrolling a thread yourself!

More from @apolloaisafety

Apollo Research

Apollo Research

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!