If you'd asked me a year ago, superposition would have been by far the reason I was most worried that mechanistic interpretability would hit a dead end.
I'm now very optimistic. I'd go as far as saying it's now primarily an engineering problem -- hard, but less fundamental risk.
Well trained sparse autoencoders (scale matters!) can decompose a one-layer model into very nice, interpretable features.
They might not be 100% monosemantic, but they're damn close. We do detailed case studies and I feel comfortable saying they're at least as monosemantic as InceptionV1 curve detectors (). distill.pub/2020/circuits/…
But it's not just cherry picked features. The vast majority of the features are this nice. And you can check for yourself - we published all the features in an interface you can use to explore them!
There's a lot that could be said, but one of the coolest things to me was that we found "finite state automata"-like assemblies of features. The simplest case is features which cause themselves to fire more.
(Keep in mind that this is a small one-layer model -- it's dumb!)
A slightly more complex system models "all caps snake case" variables.
This four node system models HTML.
And this one IRC messages
It's also worth noting that this is part of a broader trend of a lot of recent progress on attacking superposition. I'd especially highlight contemporaneous work by Cunningham et al which has some really nice confirmatory results:
OK, one other thing I can't resist adding is that we found region neurons (eg. Australia, Canada, Africa, etc neurons) similar to these old results in CLIP () and also to the recent results by @wesg52 et al ().
There's much, much more in the paper (would it really be an Anthropic interpretability paper if it wasn't 100+ pages??), but I'll let you actually look at it:
@zefu_superhzf (2) Although we can understand attention heads, I suspect that something like "attention head superposition" is going on and that we could give a simpler explanation if we could resolve it.
@zefu_superhzf See more discussion in paper.
@SussilloDavid I once saw a paper (from Surya Ganguli?) that suggested maybe the brain does compressed sensing when communicating between different regions through bottlenecks.
This is much closer to the thing we're imagining with superposition!
@SussilloDavid but the natural extension of this is that actually, maybe all neural activations are in a compressed form (and then the natural thing to do is dictionary learning)
@SussilloDavid I'd be very curious to know what happens if someone takes biological neural recorings over lots of data and does a dictionary learning / sparse autoencoder expansion..
@SussilloDavid I think superposition is much more specific. You can think of it as a sub-type of "distributed representation". I tried to pin this down here: transformer-circuits.pub/2023/superposi…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
One of the ideas I find most useful from @AnthropicAI's Core Views on AI Safety post (anthropic.com/index/core-vie…) is thinking in terms of a distribution over safety difficulty.
Here's a cartoon picture I like for thinking about it:
A lot of AI safety discourse focuses on very specific models of AI and AI safety. These are interesting, but I don't know how I could be confident in any one.
I prefer to accept that we're just very uncertain. One important axis of that uncertainty is roughly "difficulty".
In this lens, one can see a lot of safety research as "eating marginal probability" of things going well, progressively addressing harder and harder safety scenarios.
Very cool to see more parallels between neuroscience and mechanistic interpretability!
There's a long history of us discovering analogs of long-known results in neuroscience -- it's exciting to see a bit of the reverse! (Also neat to see use of interpretability methods!)
When we were working on vision models, we constantly discovered that features we were discovering such as curve detectors (distill.pub/2020/circuits/…) and multimodal neurons (distill.pub/2021/multimoda…) had existing parallels in neuroscience.
These new results follow a recent trend of places where neuroscience results are starting to have parallels with results found in neural networks!
(Caveat that I'm not super well read in neuroscience. When working on high-low frequency detectors we asked a number of neuroscientists if they'd seen such neurons before, and they hadn't. But very possible something was missed!)
Our original paper on high-low frequency detectors in artificial neural networks. distill.pub/2020/circuits/…
I feel like these are the same camps I hear in discussion of AI (eg. open sourcing models).
I ultimately lean towards caution, but I empathize with all three of these stances.
Fermi advocated for being conservative -- for scientific humility. Knowing the context of the atomic bomb being true, this seems strange. But putting myself in his position, it's so easy to empathize with.
Historically, I think I've often been tempted to think of overfit models as pathological things that need to be trained better in order for us to interpret.
For example, in this paper we found that overfitting seemed to cause less interpretable features: distill.pub/2020/understan…
These results suggest that overfitting and memorization can be *beautiful*. There can be simple, elegant mechanisms: data points arranged as polygons. The heavy use of superposition just makes this hard to see.
Of course, this is only in a toy model. But it's very suggestive!