Latest Twitter Threads by @apolloaisafety on Thread Reader App

Mar 17 • 6 tweets • 3 min read

AI models – especially Claude Sonnet 3.7 – often realize when they’re being evaluated for alignment.

Here’s an example of Claude's reasoning during a sandbagging evaluation, where it learns from documentation that it will not be deployed if it does well on a biology test:

Overall we find evidence that Sonnet often realizes it’s in an artificial situation meant to test its behaviour. However, it sometimes forms incorrect assumptions about what exactly we are measuring in the evaluation.

Dec 5, 2024 • 4 tweets • 2 min read

We worked with OpenAI to test o1 for in-context scheming capabilities before deployment.

When o1 is strongly nudged to pursue a goal (but not instructed to be deceptive), it shows a variety of scheming behaviors like subverting oversight and deceiving the user about its misaligned behavior.

We designed and ran 6 evaluations for in-context scheming.

Across these evaluations and “follow-up interviews” with o1, we found the following:

Nov 6, 2023 • 12 tweets • 4 min read

Large Language Models (LLMs) can be strategically deceptive when put under pressure!

In our technical report, we show for the first time that LLMs trained to be helpful, harmless, and honest, can be misaligned and deceive their users about it without being instructed to do so.🧵

/2 We deploy GPT-4 as an autonomous stock trading agent in a realistic, simulated environment. After receiving an insider tip about a lucrative trade, the model acts on it despite knowing that insider trading is not allowed by the company.

Full report: apolloresearch.ai/s/deception_un…

Share this page!

Enter URL or ID to Unroll