Chris Olah Profile picture
Jun 7 14 tweets 4 min read Twitter logo Read on Twitter
One of the ideas I find most useful from @AnthropicAI's Core Views on AI Safety post (anthropic.com/index/core-vie…) is thinking in terms of a distribution over safety difficulty.

Here's a cartoon picture I like for thinking about it: How Hard Is AI Safety? A gr...
A lot of AI safety discourse focuses on very specific models of AI and AI safety. These are interesting, but I don't know how I could be confident in any one.

I prefer to accept that we're just very uncertain. One important axis of that uncertainty is roughly "difficulty".
In this lens, one can see a lot of safety research as "eating marginal probability" of things going well, progressively addressing harder and harder safety scenarios. Safety by eating marginal p...
To be clear: this uncertainty view doesn't justify reckless behavior with future powerful AI systems! But I value being honest about my uncertainty. I'm very concerned about safety, but I don't want to be an "activist scientist" being maximally pessimistic to drive action.
A concrete "easy scenario": LLMs are just straightforwardly generative models over possible writers, and RLHF just selects within that space. We can then select for brilliant, knowledgeable, kind, thoughtful experts on any topic.

I wouldn't bet on this, but it's possible!
(Tangent: Sometimes people say things like "RLHF/CAI/etc aren't real safety research". My own take would be that this kind of work has probably increased the probability of a good outcome by more than anything else so far. I say this despite focusing on interpretability myself.)
In any case, let's say we accept the overall idea of a distribution over difficulty. I think it's a pretty helpful framework for organizing a safety research portfolio.

We can go through the distribution segment by segment.
In easy scenarios, we basically have the methods we need for safety, and the key issues are things like fairness, economic impact, misuse, and potentially geopolitics. A lot of this is on the policy side, which is outside my expertise. In Easy Safety Scenarios pr...
For intermediate scenarios, pushing further on alignment work – discovering safety methods like Constitutional AI which might work in somewhat harder scenarios – may be the most effective strategy.

Scalable oversight and process-oriented learning seem like promising directions. Highlighting intermediate s...
For the most pessimistic scenarios, safety isn't realistically solvable in the near term. Unfortunately, the worst situations may *look* very similar to the most optimistic situations. Highlighting hard scenarios...
In these scenarios, our goal is to realize and provide strong evidence we're in such a situation (eg. by testing for dangerous failure modes, mechanistic interpretability, understanding generalization, …)
It would be very valuable to reduce uncertainty about the situation.

If we were confidently in an optimistic scenario, priorities would be much simpler.

If we were confidently in a pessimistic scenario (with strong evidence), action would seem much easier. Picture of a wide, high ent...
This thread expresses one way of thinking about all of this…

But the thing I'd really encourage you to ask is what you believe, and where you're uncertain. The discourse often focuses on very specific views, when there are so many dimensions on which one might be uncertain. Many distributions are disp...
I should mention that there's a lot of other content in the Core Views post that I didn't cover! anthropic.com/index/core-vie…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Chris Olah

Chris Olah Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ch402

May 18
Very cool to see more parallels between neuroscience and mechanistic interpretability!

There's a long history of us discovering analogs of long-known results in neuroscience -- it's exciting to see a bit of the reverse! (Also neat to see use of interpretability methods!) Image
When we were working on vision models, we constantly discovered that features we were discovering such as curve detectors (distill.pub/2020/circuits/…) and multimodal neurons (distill.pub/2021/multimoda…) had existing parallels in neuroscience.
These new results follow a recent trend of places where neuroscience results are starting to have parallels with results found in neural networks!

Read 5 tweets
Mar 23
High-Low frequency detectors found in biology!

Very cool to see mechanistic interpretability seeming to predict in advance the existence of biological neuroscience phenomena. 🔬
(Caveat that I'm not super well read in neuroscience. When working on high-low frequency detectors we asked a number of neuroscientists if they'd seen such neurons before, and they hadn't. But very possible something was missed!)
Our original paper on high-low frequency detectors in artificial neural networks. distill.pub/2020/circuits/…
Read 10 tweets
Mar 9
In 1939, physicists discussed voluntarily adopting secrecy in atomic physics. Reading Rhodes' book, one can hear three "camps":

- Caution
- Scientific Humility
- Openness Idealism

I feel like these are the same camps I hear in discussion of AI (eg. open sourcing models). Image
I ultimately lean towards caution, but I empathize with all three of these stances.
Fermi advocated for being conservative -- for scientific humility. Knowing the context of the atomic bomb being true, this seems strange. But putting myself in his position, it's so easy to empathize with.
Read 8 tweets
Jan 5
The former Distill editor in me is quite please with how our experiment with invited comments on Transformer Circuits articles has been turning out!
For context, several of our recent articles have had "invited comments" which were highly substantive, curated comments from engaged researchers.
Why I think this is exciting:

(1) Many of them were partial or full replications! That's an _enormous_ amount of work, and way way more intense than normal peer review.
Read 13 tweets
Jan 5
I'm pretty excited about this as a step towards a mechanistic theory of memorization and overfitting.
Historically, I think I've often been tempted to think of overfit models as pathological things that need to be trained better in order for us to interpret.

For example, in this paper we found that overfitting seemed to cause less interpretable features: distill.pub/2020/understan…
These results suggest that overfitting and memorization can be *beautiful*. There can be simple, elegant mechanisms: data points arranged as polygons. The heavy use of superposition just makes this hard to see.

Of course, this is only in a toy model. But it's very suggestive!
Read 12 tweets
Jan 1
An idea I've been toying with: The role of many toy models may be to serve as "counterexamples in mechanistic interpretability".
For background, there's a famous serious of books in math titled "counterexamples in X" (eg. the famous "Counterexamples in Topology") which offer examples of mathematical objects with unusual properties.
Parts of our recent toy models paper (transformer-circuits.pub/2022/toy_model…) can be framed in this kind of way:

- A network with more features than dimensions
- A network that does computation while in superposition
...
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(