Steven Adler Profile picture
May 22 16 tweets 5 min read Read on X
Anthropic announced they've activated "Al Safety Level 3 Protections" for their latest model. What does this mean, and why does it matter?

Let me share my perspective as OpenAl's former lead for dangerous capabilities testing. (Thread) Image
Before a new model's release, Al companies commonly (though not always) run safety tests - and release the results in a "System Card."

The idea is to see if the model has any extreme abilities (like strong cyberhacking), and then to take an appropriate level of caution. Image
Anthropic's approach to testing is in a document called its Responsible Scaling Policy.

Many Al cos have their own version: OpenAl's Preparedness Framework, or Google's Frontier Safety Framework.

This is the first time that a model has reached a safety testing level this high. Image
Specifically, Anthropic says its latest model is now quite strong at bioweapons-related tasks.

Anthropic can't rule out if it can "significantly help" undergrads to "create/obtain and deploy CBRN
weapons."

The next level is making a state bioweapons program even stronger. Image
No model has hit this threshold before. And the world has many undergrads, so helping them with bioweapons would be quite risky.

Anthropic's chief scientist gives an example: strengthening a novice terrorist like Oklahoma City bomber Timothy McVeigh, who killed 168 people. Image
Another example given is helping to synthesize and release a dangerous virus. Image
How does Anthropic determine the model's abilities?

They see if they can find conclusive evidence of powerful abilities - such as by "uplifting" the skills
of ordinary people.

They also check if the model is missing any abilities that definitely rule out these powers. Image
Here are the tests Anthropic runs for weaponry:

For instance, the "Bioweapons knowlegde questions" tests if Al can basically be "an expert in your pocket" for answering questions about bioweapons. Image
Here's one example result: "Bioweapons acquisition uplift"

Anthropic says the results don't 100% cross the
ASL-3 threshold, but are really close.

I agree: I've roughly estimated a red line for the
ASL-3 bar for this test. I see why Anthropic feels they can't rule this out. Image
So, what happens now that the model is ASL-3?

Anthropic says they make the model not answer these questions. And they have new security measures they say make it harder for a terrorist group to steal the model. (But not strong enough to stop e.g., China.) Image
I'm impressed with the thoroughness of Anthropic's testing and disclosures here. I haven't yet vetted the efficacy of the measures, but there's a lot of detail. Many companies would say far less.
But it's a problem that tests like these are totally voluntary today. Anthropic says it is setting a positive example, which I generally believe. But that's not enough: Tests like these should be required across leading Al companies, as I've written about previously. Image
And the pace of Al progress is really, really fast. If Anthropic has a model like this today, when will many others? What happens when there's a DeepSeek model with these abilities, freely downloadable on the internet? We need to contain models like these before it's too late. Image
I need to run now, but happy to answer any questions when I'm back. This stuff is important to understand. I'm excited we can have public dialogue about it!
(And if you want to read the articles I’ve referenced: )stevenadler.substack.com
On reflection: Today makes me extra worried about the US-China AI race.

Anthropic triggering ASL-3 means others might soon too. There's basically no secret sauce left. Are we really ready for a "DeepSeek for bioweapons"?

How do we avoid it? open.substack.com/pub/stevenadle…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Steven Adler

Steven Adler Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @sjgadler

Jan 27
Some personal news: After four years working on safety across @openai, I left in mid-November. It was a wild ride with lots of chapters - dangerous capability evals, agent safety/control, AGI and online identity, etc. - and I'll miss many parts of it.
Honestly I'm pretty terrified by the pace of AI development these days. When I think about where I'll raise a future family, or how much to save for retirement, I can't help but wonder: Will humanity even make it to that point?
IMO, an AGI race is a very risky gamble, with huge downside. No lab has a solution to AI alignment today. And the faster we race, the less likely that anyone finds one in time.
Read 5 tweets
Aug 16, 2024
Think you can tell if a social media account is a bot? What about as AI gets better?

A new paper—co-authored with researchers from ~20 orgs, & my OpenAI teammates Zoë Hitzig and David Schnurr—asks this question: What are AI-proof ways to tell who’s real online? (1/n) Image
People want to be able to trust in others online - that folks on dating apps aren’t fake accounts trying to trick or scam them. But as AI becomes more realistic, how can you be sure? Realistic photos and videos of someone might not be enough. (2/n) Image
Another part of the challenge is that AI is becoming more accessible over time – which is great and accelerates many benefits. But those same beneficial capabilities can now be easier to misuse, as AI becomes cheaper and easier to access. (3/n) Image
Read 16 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(