One of the ideas I find most useful from @AnthropicAI's Core Views on AI Safety post (anthropic.com/index/core-vie…) is thinking in terms of a distribution over safety difficulty.
Here's a cartoon picture I like for thinking about it:
A lot of AI safety discourse focuses on very specific models of AI and AI safety. These are interesting, but I don't know how I could be confident in any one.
I prefer to accept that we're just very uncertain. One important axis of that uncertainty is roughly "difficulty".
In this lens, one can see a lot of safety research as "eating marginal probability" of things going well, progressively addressing harder and harder safety scenarios.
To be clear: this uncertainty view doesn't justify reckless behavior with future powerful AI systems! But I value being honest about my uncertainty. I'm very concerned about safety, but I don't want to be an "activist scientist" being maximally pessimistic to drive action.
A concrete "easy scenario": LLMs are just straightforwardly generative models over possible writers, and RLHF just selects within that space. We can then select for brilliant, knowledgeable, kind, thoughtful experts on any topic.
I wouldn't bet on this, but it's possible!
(Tangent: Sometimes people say things like "RLHF/CAI/etc aren't real safety research". My own take would be that this kind of work has probably increased the probability of a good outcome by more than anything else so far. I say this despite focusing on interpretability myself.)
In any case, let's say we accept the overall idea of a distribution over difficulty. I think it's a pretty helpful framework for organizing a safety research portfolio.
We can go through the distribution segment by segment.
In easy scenarios, we basically have the methods we need for safety, and the key issues are things like fairness, economic impact, misuse, and potentially geopolitics. A lot of this is on the policy side, which is outside my expertise.
For intermediate scenarios, pushing further on alignment work – discovering safety methods like Constitutional AI which might work in somewhat harder scenarios – may be the most effective strategy.
Scalable oversight and process-oriented learning seem like promising directions.
For the most pessimistic scenarios, safety isn't realistically solvable in the near term. Unfortunately, the worst situations may *look* very similar to the most optimistic situations.
In these scenarios, our goal is to realize and provide strong evidence we're in such a situation (eg. by testing for dangerous failure modes, mechanistic interpretability, understanding generalization, …)
It would be very valuable to reduce uncertainty about the situation.
If we were confidently in an optimistic scenario, priorities would be much simpler.
If we were confidently in a pessimistic scenario (with strong evidence), action would seem much easier.
This thread expresses one way of thinking about all of this…
But the thing I'd really encourage you to ask is what you believe, and where you're uncertain. The discourse often focuses on very specific views, when there are so many dimensions on which one might be uncertain.
I should mention that there's a lot of other content in the Core Views post that I didn't cover! anthropic.com/index/core-vie…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Very cool to see more parallels between neuroscience and mechanistic interpretability!
There's a long history of us discovering analogs of long-known results in neuroscience -- it's exciting to see a bit of the reverse! (Also neat to see use of interpretability methods!)
When we were working on vision models, we constantly discovered that features we were discovering such as curve detectors (distill.pub/2020/circuits/…) and multimodal neurons (distill.pub/2021/multimoda…) had existing parallels in neuroscience.
These new results follow a recent trend of places where neuroscience results are starting to have parallels with results found in neural networks!
(Caveat that I'm not super well read in neuroscience. When working on high-low frequency detectors we asked a number of neuroscientists if they'd seen such neurons before, and they hadn't. But very possible something was missed!)
Our original paper on high-low frequency detectors in artificial neural networks. distill.pub/2020/circuits/…
I feel like these are the same camps I hear in discussion of AI (eg. open sourcing models).
I ultimately lean towards caution, but I empathize with all three of these stances.
Fermi advocated for being conservative -- for scientific humility. Knowing the context of the atomic bomb being true, this seems strange. But putting myself in his position, it's so easy to empathize with.
Historically, I think I've often been tempted to think of overfit models as pathological things that need to be trained better in order for us to interpret.
For example, in this paper we found that overfitting seemed to cause less interpretable features: distill.pub/2020/understan…
These results suggest that overfitting and memorization can be *beautiful*. There can be simple, elegant mechanisms: data points arranged as polygons. The heavy use of superposition just makes this hard to see.
Of course, this is only in a toy model. But it's very suggestive!
For background, there's a famous serious of books in math titled "counterexamples in X" (eg. the famous "Counterexamples in Topology") which offer examples of mathematical objects with unusual properties.