Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Steven Adler

@sjgadler

May 22 • 16 tweets • 5 min read • Read on X

Scrolly

Anthropic announced they've activated "Al Safety Level 3 Protections" for their latest model. What does this mean, and why does it matter?

Let me share my perspective as OpenAl's former lead for dangerous capabilities testing. (Thread)

Before a new model's release, Al companies commonly (though not always) run safety tests - and release the results in a "System Card."

The idea is to see if the model has any extreme abilities (like strong cyberhacking), and then to take an appropriate level of caution.

Anthropic's approach to testing is in a document called its Responsible Scaling Policy.

Many Al cos have their own version: OpenAl's Preparedness Framework, or Google's Frontier Safety Framework.

This is the first time that a model has reached a safety testing level this high.

Specifically, Anthropic says its latest model is now quite strong at bioweapons-related tasks.

Anthropic can't rule out if it can "significantly help" undergrads to "create/obtain and deploy CBRN
weapons."

The next level is making a state bioweapons program even stronger.

No model has hit this threshold before. And the world has many undergrads, so helping them with bioweapons would be quite risky.

Anthropic's chief scientist gives an example: strengthening a novice terrorist like Oklahoma City bomber Timothy McVeigh, who killed 168 people.

Another example given is helping to synthesize and release a dangerous virus.

How does Anthropic determine the model's abilities?

They see if they can find conclusive evidence of powerful abilities - such as by "uplifting" the skills
of ordinary people.

They also check if the model is missing any abilities that definitely rule out these powers.

Here are the tests Anthropic runs for weaponry:

For instance, the "Bioweapons knowlegde questions" tests if Al can basically be "an expert in your pocket" for answering questions about bioweapons.

Here's one example result: "Bioweapons acquisition uplift"

Anthropic says the results don't 100% cross the
ASL-3 threshold, but are really close.

I agree: I've roughly estimated a red line for the
ASL-3 bar for this test. I see why Anthropic feels they can't rule this out.

So, what happens now that the model is ASL-3?

Anthropic says they make the model not answer these questions. And they have new security measures they say make it harder for a terrorist group to steal the model. (But not strong enough to stop e.g., China.)

I'm impressed with the thoroughness of Anthropic's testing and disclosures here. I haven't yet vetted the efficacy of the measures, but there's a lot of detail. Many companies would say far less.

But it's a problem that tests like these are totally voluntary today. Anthropic says it is setting a positive example, which I generally believe. But that's not enough: Tests like these should be required across leading Al companies, as I've written about previously.

And the pace of Al progress is really, really fast. If Anthropic has a model like this today, when will many others? What happens when there's a DeepSeek model with these abilities, freely downloadable on the internet? We need to contain models like these before it's too late.

I need to run now, but happy to answer any questions when I'm back. This stuff is important to understand. I'm excited we can have public dialogue about it!

(And if you want to read the articles I’ve referenced: )stevenadler.substack.com

On reflection: Today makes me extra worried about the US-China AI race.

Anthropic triggering ASL-3 means others might soon too. There's basically no secret sauce left. Are we really ready for a "DeepSeek for bioweapons"?

How do we avoid it? open.substack.com/pub/stevenadle…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @sjgadler

Steven Adler

@sjgadler

Jan 27

Some personal news: After four years working on safety across @openai, I left in mid-November. It was a wild ride with lots of chapters - dangerous capability evals, agent safety/control, AGI and online identity, etc. - and I'll miss many parts of it.

Honestly I'm pretty terrified by the pace of AI development these days. When I think about where I'll raise a future family, or how much to save for retirement, I can't help but wonder: Will humanity even make it to that point?

IMO, an AGI race is a very risky gamble, with huge downside. No lab has a solution to AI alignment today. And the faster we race, the less likely that anyone finds one in time.

Read 5 tweets

Steven Adler

@sjgadler

Aug 16, 2024

Think you can tell if a social media account is a bot? What about as AI gets better?

A new paper—co-authored with researchers from ~20 orgs, & my OpenAI teammates Zoë Hitzig and David Schnurr—asks this question: What are AI-proof ways to tell who’s real online? (1/n)

People want to be able to trust in others online - that folks on dating apps aren’t fake accounts trying to trick or scam them. But as AI becomes more realistic, how can you be sure? Realistic photos and videos of someone might not be enough. (2/n)

Another part of the challenge is that AI is becoming more accessible over time – which is great and accelerates many benefits. But those same beneficial capabilities can now be easier to misuse, as AI becomes cheaper and easier to access. (3/n)

Read 16 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Steven Adler

Try unrolling a thread yourself!

More from @sjgadler

Steven Adler

Steven Adler

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!