Tweet

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Chris Olah

@ch402

Oct 5 • 18 tweets • 5 min read Twitter logo

Read on Twitter

https://twitter.com/AnthropicAI/status/1709986949711200722

If you'd asked me a year ago, superposition would have been by far the reason I was most worried that mechanistic interpretability would hit a dead end.

I'm now very optimistic. I'd go as far as saying it's now primarily an engineering problem -- hard, but less fundamental risk.

https://twitter.com/AnthropicAI/status/1709986949711200722

Well trained sparse autoencoders (scale matters!) can decompose a one-layer model into very nice, interpretable features.

They might not be 100% monosemantic, but they're damn close. We do detailed case studies and I feel comfortable saying they're at least as monosemantic as InceptionV1 curve detectors (). distill.pub/2020/circuits/…

But it's not just cherry picked features. The vast majority of the features are this nice. And you can check for yourself - we published all the features in an interface you can use to explore them!

transformer-circuits.pub/2023/monoseman…

There's a lot that could be said, but one of the coolest things to me was that we found "finite state automata"-like assemblies of features. The simplest case is features which cause themselves to fire more.

(Keep in mind that this is a small one-layer model -- it's dumb!)

A slightly more complex system models "all caps snake case" variables.

This four node system models HTML.

And this one IRC messages

https://twitter.com/HoagyCunningham/status/1704882738526871767

It's also worth noting that this is part of a broader trend of a lot of recent progress on attacking superposition. I'd especially highlight contemporaneous work by Cunningham et al which has some really nice confirmatory results:

https://twitter.com/HoagyCunningham/status/1704882738526871767

https://twitter.com/ch402/status/1368652791414091777

OK, one other thing I can't resist adding is that we found region neurons (eg. Australia, Canada, Africa, etc neurons) similar to these old results in CLIP () and also to the recent results by @wesg52 et al ().

Universality!

https://twitter.com/ch402/status/1368652791414091777

https://twitter.com/wesg52/status/1709551516577902782

Sorry, the link above was incorrect. It should have been transformer-circuits.pub/2023/monoseman…

There's much, much more in the paper (would it really be an Anthropic interpretability paper if it wasn't 100+ pages??), but I'll let you actually look at it:

transformer-circuits.pub/2023/monoseman…

@zefu_superhzf (2) Although we can understand attention heads, I suspect that something like "attention head superposition" is going on and that we could give a simpler explanation if we could resolve it.

@zefu_superhzf See more discussion in paper.

@SussilloDavid I once saw a paper (from Surya Ganguli?) that suggested maybe the brain does compressed sensing when communicating between different regions through bottlenecks.

This is much closer to the thing we're imagining with superposition!

@SussilloDavid but the natural extension of this is that actually, maybe all neural activations are in a compressed form (and then the natural thing to do is dictionary learning)

@SussilloDavid I'd be very curious to know what happens if someone takes biological neural recorings over lots of data and does a dictionary learning / sparse autoencoder expansion..

@SussilloDavid I think superposition is much more specific. You can think of it as a sub-type of "distributed representation". I tried to pin this down here: transformer-circuits.pub/2023/superposi…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @ch402

Chris Olah

@ch402

Jun 7

@AnthropicAI

One of the ideas I find most useful from @AnthropicAI's Core Views on AI Safety post (anthropic.com/index/core-vie…) is thinking in terms of a distribution over safety difficulty.

Here's a cartoon picture I like for thinking about it:

A lot of AI safety discourse focuses on very specific models of AI and AI safety. These are interesting, but I don't know how I could be confident in any one.

I prefer to accept that we're just very uncertain. One important axis of that uncertainty is roughly "difficulty".

In this lens, one can see a lot of safety research as "eating marginal probability" of things going well, progressively addressing harder and harder safety scenarios.

Read 14 tweets

Chris Olah

@ch402

May 18

https://twitter.com/AToliasLab/status/1659236747618680836

Very cool to see more parallels between neuroscience and mechanistic interpretability!

There's a long history of us discovering analogs of long-known results in neuroscience -- it's exciting to see a bit of the reverse! (Also neat to see use of interpretability methods!)

https://twitter.com/AToliasLab/status/1659236747618680836

When we were working on vision models, we constantly discovered that features we were discovering such as curve detectors (distill.pub/2020/circuits/…) and multimodal neurons (distill.pub/2021/multimoda…) had existing parallels in neuroscience.

https://twitter.com/ch402/status/1638931735202385920

These new results follow a recent trend of places where neuroscience results are starting to have parallels with results found in neural networks!

https://twitter.com/ch402/status/1638931735202385920

Read 5 tweets

Chris Olah

@ch402

Mar 23

https://twitter.com/zeeweeding/status/1638257954398011394

High-Low frequency detectors found in biology!

Very cool to see mechanistic interpretability seeming to predict in advance the existence of biological neuroscience phenomena. 🔬

https://twitter.com/zeeweeding/status/1638257954398011394

(Caveat that I'm not super well read in neuroscience. When working on high-low frequency detectors we asked a number of neuroscientists if they'd seen such neurons before, and they hadn't. But very possible something was missed!)

Our original paper on high-low frequency detectors in artificial neural networks. distill.pub/2020/circuits/…

Read 10 tweets

Chris Olah

@ch402

Mar 9

In 1939, physicists discussed voluntarily adopting secrecy in atomic physics. Reading Rhodes' book, one can hear three "camps":

- Caution
- Scientific Humility
- Openness Idealism

I feel like these are the same camps I hear in discussion of AI (eg. open sourcing models).

I ultimately lean towards caution, but I empathize with all three of these stances.

Fermi advocated for being conservative -- for scientific humility. Knowing the context of the atomic bomb being true, this seems strange. But putting myself in his position, it's so easy to empathize with.

Read 8 tweets

Chris Olah

@ch402

Jan 5

The former Distill editor in me is quite please with how our experiment with invited comments on Transformer Circuits articles has been turning out!

For context, several of our recent articles have had "invited comments" which were highly substantive, curated comments from engaged researchers.

Why I think this is exciting:

(1) Many of them were partial or full replications! That's an _enormous_ amount of work, and way way more intense than normal peer review.

Read 13 tweets

Chris Olah

@ch402

Jan 5

https://twitter.com/AnthropicAI/status/1611045993516249088

I'm pretty excited about this as a step towards a mechanistic theory of memorization and overfitting.

https://twitter.com/AnthropicAI/status/1611045993516249088

Historically, I think I've often been tempted to think of overfit models as pathological things that need to be trained better in order for us to interpret.

For example, in this paper we found that overfitting seemed to cause less interpretable features: distill.pub/2020/understan…

These results suggest that overfitting and memorization can be *beautiful*. There can be simple, elegant mechanisms: data points arranged as polygons. The heavy use of superposition just makes this hard to see.

Of course, this is only in a toy model. But it's very suggestive!

Read 12 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Chris Olah

Try unrolling a thread yourself!

More from @ch402

Chris Olah

Chris Olah

Chris Olah

Chris Olah

Chris Olah

Chris Olah

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!