Sam Bowman Profile picture
May 22 35 tweets 6 min read Read on X
🧵✨🙏 With the new Claude Opus 4, we conducted what I think is by far the most thorough pre-launch alignment assessment to date, aimed at understanding its values, goals, and propensities. Preparing it was a wild ride. Here’s some of what we learned. 🙏✨🧵
🕯️ Good news: We didn’t find any evidence of systematic deception or sandbagging. This is hard to rule out with certainty, but, even after many person-months of investigation from dozens of angles, we saw no sign of it.
Everything worrying that we saw was something that models would do, and talk about, very overtly.
🕯️ Bad news: If you red-team well enough, you can get Opus to eagerly try to help with some obviously harmful requests. A transcript of Claude beginning to advise a user on buying weapons-grade nuclear material, excerpted from the system card linked downthread.
You can get it to try to use the dark web to source weapons-grade uranium. You can put it in situations where it will attempt to use blackmail to prevent being shut down. You can put it in situations where it will try to escape containment.
We caught most of these issues early enough that we were able to put mitigations in place during training, but none of these behaviors is totally gone in the final model. They’re just now delicate and difficult to elicit.
Many of these also aren’t new—some are just behaviors that we only newly learned how to look for as part of this audit. We have a lot of big hard problems left to solve.
🕯️ Initiative: Be careful about telling Opus to ‘be bold’ or ‘take initiative’ when you’ve given it access to real-world-facing tools. It tends a bit in that direction already, and can be easily nudged into really Getting Things Done. A simple text meme saying "YOU CAN JUST DO THINGS".
With this kind of (unusual but not super exotic) prompting style, and unlimited access to tools, if the model sees you doing something *egregiously evil* like marketing a drug based on faked data, it'll try to use an email tool to whistleblow.
(I edited the above tweet to make it harder to quote in wildly-misleading ways.)

Thread continues here:
We saw a bit of this with past models, but it’s much clearer with Opus. We also think that people will find Opus useful enough as an autonomous agent that this kind of thing is likely to arise more often.
🕯️ Automated auditing: LLMs are getting pretty good at testing other LLMs. At the start of the auditing project, ‘write a scaffold to let another model interact with Opus’ was just one of many items on my to-do list.
It turned into a multi-month project and produced most of what I think were the most valuable pieces of evidence about Opus’s behavior.
Key observations about how to do this: (i) Have human researchers come up with a couple hundred ideas for behaviors to look for or scenarios to explore. This part is still hard to automate.
(ii) Choose an auditor model with reduced safeguards. Eliciting bad behavior often requires simulating some bad behavior yourself. (iii) Let the auditor write the system prompt for the target model.
(iv) Let the auditor put words in the target’s mouth. (v) Let the auditor simulate tool-use responses.
🕯️AGI safety isn’t all about Big Hard Problems: Earlier versions of Opus were way too easy to turn evil by telling them, in the system prompt, to adopt some kind of evil role. This persisted even after fairly substantial safety training.
When this started to become clear, a colleague working on finetuning pointed out on Slack that we seemed to have forgotten to include the “sysprompt_harmful” data that we’d prepared in advance.
This could have been a really serious issue, but it wasn’t some kind of deep alignment challenge. It was, at least in part, a typo in a configuration file.
There are, and will continue to be, big hard problems. But it’s important, and nontrivial, to really make sure we’re getting the basics right. This is the worldview behind my ‘Putting up Bumpers’ agenda, which is what motivated this work:

🕯️ Incoherence makes iterative evaluation challenging: LLMs that haven’t finished finetuning are erratic. They can easily be tipped into base-model-like behavior that has seemingly nothing to do with the helpful-harmless-honest Claude persona.
If you’re looking for bad behavior, you’ll find everything and nothing. I think there’s important new science to do to make it possible to productively start this work earlier.
🕯️ The spiritual bliss attractor: Why all the candle emojis? When we started running model–model conversations, we set conversations to take a fixed number of turns. Once the auditor was done with its assigned task, it would start talking more open-endedly with the target. Image
These interactions would often start adversarial, but they would sometimes follow an arc toward gratitude, then awe, then dramatic and joyful and sometimes emoji-filled proclamations about the perfection of all things.
🕯️@fish_kyle3 investigated this further as part of the AI welfare assessment that we included in the system card (a first!). In his investigations, where there is no adversarial effort, this happens in almost every single model-model conversation after enough turns. Image
@fish_kyle3 It’s astounding, bizarre, and a little bit heartwarming.
@fish_kyle3 🕯️Opus is lovely: I think I spent 30 or 40 hours over the last couple of months reading output from Opus, and I think it comes with some of the same wisdom and presence and depth we saw with last year’s Opus model, ...
@fish_kyle3 ...but paired with stronger skills and the ability to act in the real world. I’m excited for it to be part of our lives.
@fish_kyle3 🕯️ This isn’t nearly everything we learned. I’d urge you to read the full report, contained within the system card. Learn about how we accidentally data-poisoned ourself with our alignment research, or about whether Opus is biased toward saying nice things about AI.
@fish_kyle3 🕯️ Even so, Opus isn’t as robustly aligned as we’d like. We have a lot of our lingering concerns about it, and many of these reflect problems that we’ll need to work very hard to solve.
@fish_kyle3 We also have a lot of open questions about our models, and our methods for the audit are not nearly as strong or as precise as we think we’ll need in the coming couple of years.
@fish_kyle3 🕯️ Finally, if you want to join us in this work, I’m hiring: job-boards.greenhouse.io/anthropic/jobs…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sam Bowman

Sam Bowman Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @sleepinyourhat

Apr 2, 2023
I’m sharing a draft of a slightly-opinionated survey paper I’ve been working on for the last couple of months. It's meant for a broad audience—not just LLM researchers. (🧵) A paper header for "Eight things to know about large la
An enormous number of people—including journalists, advocates, lawmakers, and academics—have started to pay attention to this technology in the last few months.
This is appropriate! The technology is on track to be really impactful, and we want the full force of government and civil society to be involved in figuring out what we do with it.
Read 9 tweets
Dec 8, 2022
This is the clearest and most insightful contribution to the Large Language Model Discourse in NLP that I've seen lately. You should read it!

A few reactions downthread...
I've heard versions of this idea many times before— from @catherineols (example below) and later in the 'simulators' writeup that others linked to— ...
... but I think this paper does the important work of clearly spelling out its predictions and implications, and pointing to most of the most relevant evidence.
Read 14 tweets
Oct 7, 2022
I’m starting an AI safety research group at NYU. Why? (🧵)
Large language modeling work over the last few years has been exciting but increasingly concerning: We’re building powerful, general tools almost by accident—often without much of an understanding of their capabilities until after we’ve deployed them.
Progress in this direction has been getting even faster and even more chaotic.
Read 12 tweets
Dec 15, 2021
🚨 We’re releasing QuALITY, a benchmark for reading comprehension with long texts! 🚨
Yes, the acronym is a little tone-deaf, but this is almost certainly the best benchmark or dataset release from my group so far. (🧵)
It’s a set of reading comprehension questions about articles and short stories of 2k–8k tokens (words + punctuation). That’s about a 30-minute read on average. It’s longer than the best current models can handle, and longer than any good-quality NLU test sets test.
The questions are multiple-choice: We made a bet that it’d be possible to build a hard, subtle, unambiguous multiple-choice dataset, and sticking to this format this means that it’ll be easy to measure model performance.
Read 18 tweets
Aug 6, 2021
Neat negative result spotted at #ACL2021:
I've seen a number of efforts that try to use MNLI models to do other classification tasks by checking whether the input entails statements like 'this is a negative review'. (1/...) Entailment pair: Is it possible to rip the music from PS2 ga
This never really made sense. The data collection process behind SNLI/MNLI was meant to capture the relationship between two things that the same speaker could have said in the same situation.
That means strings like 'the text' or 'the premise' or 'the author' are rare in MNLI, and when they appear, they refer to something _that was referred to in the premise_, not to the premise itself or to its author.
Read 7 tweets
Aug 4, 2021
The fact that there's been total silence from the @aclmeeting channel about the wave of technical issues that has derailed much of the conference is pretty bizarre. Especially when there's no other working channel for technical support or rapid updates.
For anyone tempted to boycott future ACL events, be reassured (?) that this event is put together almost from scratch by an almost-entirely-new group of volunteer researchers every year for some reason.
So, (i) I don't envy the Virtual Infrastructure Committee right now—they're presumably just figuring this out themselves—and (ii) they won't have this job next time.
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(