Joshua Clymer Profile picture
Mar 20 12 tweets 4 min read Read on X
If developers had to prove to regulators that powerful AI systems are safe to deploy, what are the best arguments they could use?

Our new report tackles the (very big!) question of how to make a ‘safety case’ for AI. Image
We define a safety case as a rationale developers provide to regulators to show that their AI systems are unlikely to cause a catastrophe.

The term ‘safety case’ is not new. In many industries (e.g. aviation), products are ‘put on trial’ before they are released. Image
We simplify the process of making a safety case by breaking it into six steps.

1. Specify the macrosystem (all AI systems) and the deployment setting.
2. Concretize 'AI systems cause a catastrophe' into specific unacceptable outcomes (e.g. the AI systems build a bioweapon)
...
3. Justify claims about the deployment setting.
4. Carve up the collection of AI systems into smaller groups (subsystems) that can be analyzed in isolation.
5. Assess risk from subsystems acting unilaterally.
6. Assess risk from subsystems cooperating together. Image
We first focus on step 5: how would developers argue that individual subsystems are safe? These arguments can be grouped into four categories.
Arguments could assert AI systems are
- too weak to be dangerous even if open-sourced (inability)
- unable to cause disaster due to restrictions (control)
- safe despite ablility to cause harm (trustworthiness)
- judged to be safe by credible AI advisors (deference) Image
Consider the analogy of a robber and a diamond. How might one justify that the robber won’t steal the diamond? Image
The ordering of these categories is intentional. As AI systems become more powerful, developers will likely rely mostly on inability, then control, then trustworthiness, and finally, deference to AI advisors. Image
Next, we give examples of arguments in each category. Arguments are ranked on three axes:
1. Practicality
2. Strength
3. Scalability
No argument received full marks! Research will be needed to justify the safety of advanced AI systems. Image
The arguments in the previous step pertain to small groups of AI systems. It would be difficult to directly apply them to large groups. We also explain how to justify that the actions of *many* AI systems won’t cause a catastrophe (step 6 in our framework). Image
We are hoping this report will:

1. Motivate research that further clarifies the assumptions behind safety arguments.
2. Inform the design of hard safety standards.
More in the paper:
Many thanks to my coauthors! @NickGabs01, @DavidSKrueger, and @thlarsen.

Might be of interest to @bshlgrs, @RogerGrosse, @DavidDuvenaud, @EvanHub, @aleks_madry, @ancadianadragan, @rohinmshah, @jackclarkSF, @Manderljung, @RichardMCNgobit.ly/3IJ5N95

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Joshua Clymer

Joshua Clymer Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @joshua_clymer

May 10
Can we catch models that fake alignment by examining their internals?

Our new paper trains 'sleeper agents' that strategically pretend to be safe and finds methods to unmask them. Image
There is lively debate about whether model internals are useful for detecting alignment fakers (models that strategically hide harmful intentions). To ground this debate, we propose a testbed for detecting alignment-faking LLMs.
We train 324 pairs of models in which one LLM roleplays a benign agent, and one LLM roleplays an alignment faker. The goal is to distinguish the alignment-faking LLM from the benign LLM using only inputs where they *behave the same.* Image
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(