Latest Twitter Threads by @NeelNanda5 on Thread Reader App

Jul 31, 2024 • 9 tweets • 4 min read

Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training costs limit research

Announcing Gemma Scope: An open suite of SAEs on every layer & sublayer of Gemma 2 2B & 9B! We hope to enable even more ambitious work

Want to learn more? @neuronpedia have made a gorgeous interactive demo walking you through what Sparse Autoencoders are, and what Gemma Scope can do.

If this could happen pre-launch, I'm excited to see what the community will do with Gemma Scope now!
neuronpedia.org/gemma-scope

Dec 23, 2023 • 11 tweets • 3 min read

My first @GoogleDeepMind project: How do LLMs recall facts?

Early MLP layers act as a lookup table, with significant superposition! They recognise entities and produce their attributes as directions. We suggest viewing fact recall as a black box making "multi-token embeddings”

Our hope was to understand a circuit in superposition at the parameter level, but we failed at this. We carefully falsify several naive hypotheses, but fact recall seems pretty cursed. We can black box the lookup part, so this doesn't sink the mech interp agenda, but it's a blow.

Oct 24, 2023 • 4 tweets • 2 min read

I recently did an open source replication on @AnthropicAI's new dictionary learning paper, which was just published as a public comment! It's a great paper and I'm glad the results hold up.

Here's a tutorial to use my autoencoders and interpret a feature
colab.research.google.com/drive/1u8larhp…

https://twitter.com/ch402/status/1709998674087227859

And you can read the blog post here:

I was particularly curious about how neuron-sparse the features are. Strikingly, I find that *most* (92%) of features are dense in the neuron basis! lesswrong.com/posts/fKuugaxt…

Sep 24, 2023 • 13 tweets • 3 min read

This paper's been doing the rounds, so I thought I'd give a mechanistic interpretability take on what's going on here!

The core intuition is that "When you see 'A is', output B" is implemented as an asymmetric look-up table, with an entry for A->B.
B->A would be a separate entry

https://twitter.com/OwainEvans_UK/status/1705285631520407821

The key question to ask with a mystery like this about models is what algorithms are needed to get the correct answer, and how these can be implemented in transformer weights. These are what get reinforced when fine-tuning.

Mar 31, 2023 • 16 tweets • 9 min read

In recent work @ke_li_2021 trained Othello-GPT to predict the next move in random legal games of Othello & found an emergent model of the board! Surprisingly, only non-linear probes worked. I found a linear model of which squares have the PLAYER'S colour!
lesswrong.com/s/nhGNHyJHbrof…

Why was it surprising that linear probes failed before? To do mech interp we must really understand how models represent thoughts, and we often assume linear representations - features as directions. But this could be wrong, we don't have enough data on circuits to be confident!

Mar 17, 2023 • 5 tweets • 4 min read

New blog post on an interpretability technique I helped develop when at @AnthropicAI: Attribution patching. This is a ludicrously fast gradient-based approximation to activation patching - in three passes, you could patch all 4.7M of GPT-3's neurons!
alignmentforum.org/posts/gtLLBhzQ…

Activation patching/causal tracing was introduced by @jesse_vig and used in @davidbau & @mengk20's ROME and @kevrowan's IOI work. The technique "diffs" two prompts, identical apart from a key detail - patching individual activations from one to the other isolates the key circuits

Jan 21, 2023 • 11 tweets • 7 min read

Excited to announce that our work, Progress Measures for Grokking via Mechanistic Interpretability, has been accepted as a spotlight at ICLR 23! (despite being rejected from Arxiv twice!)
This was significantly refined from my prior work, thoughts in 🧵
arxiv.org/abs/2301.05217

https://twitter.com/NeelNanda5/status/1559060507524403200

We trained a transformer to grok modular addition and reverse-engineered it. We found that it had learned a Fourier Transform and trig identity based algorithm, so cleanly that we can read it off the weights!

I did not expect this algorithm! I found it by reverse-engineering.

Dec 31, 2022 • 11 tweets • 3 min read

Great question! In all honesty, we just don't really know. I consider this an open scientific question, and really want to get more data on it! Some thoughts: 🧵

https://twitter.com/soniajoseph_/status/1609033710363361280

IMO, my far the strongest success here is the induction heads work - see my thread about how wild those results are & my explainer. We found a simple circuit in a 2L attn-only model, which recurs in every model studied (up to 13B), and plays a crucial role
dynalist.io/d/n2ZWtnoYHrU1…

https://twitter.com/NeelNanda5/status/1595098727026286593

Dec 28, 2022 • 20 tweets • 7 min read

I often claim that mechanistic interpretability is full of low hanging fruit. I want to put my money where my mouth is! Announcing 200 Concrete Open Problems in Mechanistic Interpretability
Post 1 is on toy language models, plus 12 toy models I've trained!
alignmentforum.org/posts/LbrPTJ4f…

We got good insights studying tiny language models in A Mathematical Framework, notably finding induction heads (so interesting we wrote a whole paper on them!) I expect there's still much to learn! I've open sourced 12 toy models: 1-4L attn-only/GELU/SoLU
alignmentforum.org/posts/GWCgZrzW…

Dec 27, 2022 • 4 tweets • 3 min read

New walkthrough: Toy Models of Superposition! We read through the paper, discuss the high-level intuitions, limitations, and takeaways of this great @AnthropicAI paper. I'm extremely impressed by the rich insights from so simple a model! (why tetrahedra?!)
We mostly focused on the conceptual takeaways in this video, let me know if you want a part 2 where we finish going through the details of the paper!

If you haven't come across the work before, check out Anthropic's thread summarising the paper:

https://twitter.com/AnthropicAI/status/1570087876053942272

Nov 22, 2022 • 12 tweets • 7 min read

New paper walkthrough! @charles_irl and I read through, discuss and share intuitions for In-Context Learning and Induction Heads by @catherineols, @nelhage, me and @ch402. This is probably the most surprising paper I've been involved, a 🧵 of takeaways:

https://twitter.com/charles_irl/status/1595097774634274817

This video should be friendly to people who haven't read the paper! Let me know if you want a more in-the-weeds part 2. And if you enjoy this, check out my walkthrough of the prequel paper, A Mathematical Framework for Transformer Circuits!

https://twitter.com/NeelNanda5/status/1580782930304978944

Nov 2, 2022 • 13 tweets • 5 min read

@kevrowan's paper is one of the coolest interp papers I've seen in a while, super excited about this line of work! They reverse engineer a 26 head circuit in GPT-2 Small, with 7 types of head and 4 layers composing. And @kevrowan is still in high school!! A 🧵 of my takeaways:

https://twitter.com/kevrowan/status/1587601532639494146

1: Deepest takeaway: This is possible! This is another data point that language model behaviour follows an interpretable algorithm, and we can localise where it happens. If @ch402's circuits agenda is wrong, then much of my research is useless, so it's nice to get more evidence!

Oct 14, 2022 • 6 tweets • 3 min read

I think A Mathematical Framework for Transformer Circuits is the most awesome paper I've been involved in, but it's very long and dense, so I think a lot of people struggle with it. As an experiment, you can watch me read through the paper and give takes! I try to clarify points that are confusing, point to bits that are or are not worth the effort of engaging with deeply, explain bits I think are particularly exciting, and generally convey what I hope people take away from the paper!

Sep 15, 2022 • 13 tweets • 6 min read

Great post by @aslvrstn! My main takeaway: Floating Points make Linear Algebra a leaky abstraction for deep learning

Some context in 🧵 1/
aslvrstn.com/posts/transfor… In @Tim_Dettmers' excellent work on running language models in int8, the part I found most confusing and surprising was that he found emergent features in the residual stream in the STANDARD basis (ie, the representation of the residual stream as floating points) 2/

https://twitter.com/Tim_Dettmers/status/1559892888326049792

Aug 15, 2022 • 20 tweets • 8 min read

I've spent the past few months exploring @OpenAI's grokking result through the lens of mechanistic interpretability. I fully reverse engineered the modular addition model, and looked at what it does when training. So what's up with grokking? A 🧵... (1/17) alignmentforum.org/posts/N6WM6hs7… Takeaway 1: There's a deep relationship between grokking and phase changes. Phase changes are an abrupt change in capabilities during training, like we see when training a 2L attn-only transformer to predict repeated subsequences (2/17)

Share this page!

Enter URL or ID to Unroll