Chris Olah Profile picture
Oct 5 18 tweets 5 min read Twitter logo Read on Twitter
If you'd asked me a year ago, superposition would have been by far the reason I was most worried that mechanistic interpretability would hit a dead end.

I'm now very optimistic. I'd go as far as saying it's now primarily an engineering problem -- hard, but less fundamental risk.
Well trained sparse autoencoders (scale matters!) can decompose a one-layer model into very nice, interpretable features.
They might not be 100% monosemantic, but they're damn close. We do detailed case studies and I feel comfortable saying they're at least as monosemantic as InceptionV1 curve detectors (). distill.pub/2020/circuits/…
Image
But it's not just cherry picked features. The vast majority of the features are this nice. And you can check for yourself - we published all the features in an interface you can use to explore them!

transformer-circuits.pub/2023/monoseman…
Image
There's a lot that could be said, but one of the coolest things to me was that we found "finite state automata"-like assemblies of features. The simplest case is features which cause themselves to fire more.

(Keep in mind that this is a small one-layer model -- it's dumb!) Image
A slightly more complex system models "all caps snake case" variables. Image
This four node system models HTML. Image
And this one IRC messages Image
It's also worth noting that this is part of a broader trend of a lot of recent progress on attacking superposition. I'd especially highlight contemporaneous work by Cunningham et al which has some really nice confirmatory results:
OK, one other thing I can't resist adding is that we found region neurons (eg. Australia, Canada, Africa, etc neurons) similar to these old results in CLIP () and also to the recent results by @wesg52 et al ().

Universality!
Sorry, the link above was incorrect. It should have been transformer-circuits.pub/2023/monoseman…
There's much, much more in the paper (would it really be an Anthropic interpretability paper if it wasn't 100+ pages??), but I'll let you actually look at it:

transformer-circuits.pub/2023/monoseman…
@zefu_superhzf (2) Although we can understand attention heads, I suspect that something like "attention head superposition" is going on and that we could give a simpler explanation if we could resolve it.
@zefu_superhzf See more discussion in paper.
@SussilloDavid I once saw a paper (from Surya Ganguli?) that suggested maybe the brain does compressed sensing when communicating between different regions through bottlenecks.

This is much closer to the thing we're imagining with superposition!
@SussilloDavid but the natural extension of this is that actually, maybe all neural activations are in a compressed form (and then the natural thing to do is dictionary learning)
@SussilloDavid I'd be very curious to know what happens if someone takes biological neural recorings over lots of data and does a dictionary learning / sparse autoencoder expansion..
@SussilloDavid I think superposition is much more specific. You can think of it as a sub-type of "distributed representation". I tried to pin this down here: transformer-circuits.pub/2023/superposi…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Chris Olah

Chris Olah Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ch402

Jun 7
One of the ideas I find most useful from @AnthropicAI's Core Views on AI Safety post (anthropic.com/index/core-vie…) is thinking in terms of a distribution over safety difficulty.

Here's a cartoon picture I like for thinking about it: How Hard Is AI Safety? A gr...
A lot of AI safety discourse focuses on very specific models of AI and AI safety. These are interesting, but I don't know how I could be confident in any one.

I prefer to accept that we're just very uncertain. One important axis of that uncertainty is roughly "difficulty".
In this lens, one can see a lot of safety research as "eating marginal probability" of things going well, progressively addressing harder and harder safety scenarios. Safety by eating marginal p...
Read 14 tweets
May 18
Very cool to see more parallels between neuroscience and mechanistic interpretability!

There's a long history of us discovering analogs of long-known results in neuroscience -- it's exciting to see a bit of the reverse! (Also neat to see use of interpretability methods!) Image
When we were working on vision models, we constantly discovered that features we were discovering such as curve detectors (distill.pub/2020/circuits/…) and multimodal neurons (distill.pub/2021/multimoda…) had existing parallels in neuroscience.
These new results follow a recent trend of places where neuroscience results are starting to have parallels with results found in neural networks!

Read 5 tweets
Mar 23
High-Low frequency detectors found in biology!

Very cool to see mechanistic interpretability seeming to predict in advance the existence of biological neuroscience phenomena. 🔬
(Caveat that I'm not super well read in neuroscience. When working on high-low frequency detectors we asked a number of neuroscientists if they'd seen such neurons before, and they hadn't. But very possible something was missed!)
Our original paper on high-low frequency detectors in artificial neural networks. distill.pub/2020/circuits/…
Read 10 tweets
Mar 9
In 1939, physicists discussed voluntarily adopting secrecy in atomic physics. Reading Rhodes' book, one can hear three "camps":

- Caution
- Scientific Humility
- Openness Idealism

I feel like these are the same camps I hear in discussion of AI (eg. open sourcing models). Image
I ultimately lean towards caution, but I empathize with all three of these stances.
Fermi advocated for being conservative -- for scientific humility. Knowing the context of the atomic bomb being true, this seems strange. But putting myself in his position, it's so easy to empathize with.
Read 8 tweets
Jan 5
The former Distill editor in me is quite please with how our experiment with invited comments on Transformer Circuits articles has been turning out!
For context, several of our recent articles have had "invited comments" which were highly substantive, curated comments from engaged researchers.
Why I think this is exciting:

(1) Many of them were partial or full replications! That's an _enormous_ amount of work, and way way more intense than normal peer review.
Read 13 tweets
Jan 5
I'm pretty excited about this as a step towards a mechanistic theory of memorization and overfitting.
Historically, I think I've often been tempted to think of overfit models as pathological things that need to be trained better in order for us to interpret.

For example, in this paper we found that overfitting seemed to cause less interpretable features: distill.pub/2020/understan…
These results suggest that overfitting and memorization can be *beautiful*. There can be simple, elegant mechanisms: data points arranged as polygons. The heavy use of superposition just makes this hard to see.

Of course, this is only in a toy model. But it's very suggestive!
Read 12 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(