Neel Nanda Profile picture
Mechanistic Interpretability lead @DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!
Jerome Ku Profile picture Jason Hoelscher-Obermaier Profile picture 2 subscribed
Dec 23, 2023 11 tweets 3 min read
My first @GoogleDeepMind project: How do LLMs recall facts?

Early MLP layers act as a lookup table, with significant superposition! They recognise entities and produce their attributes as directions. We suggest viewing fact recall as a black box making "multi-token embeddings” Image Our hope was to understand a circuit in superposition at the parameter level, but we failed at this. We carefully falsify several naive hypotheses, but fact recall seems pretty cursed. We can black box the lookup part, so this doesn't sink the mech interp agenda, but it's a blow.
Oct 24, 2023 4 tweets 2 min read
I recently did an open source replication on @AnthropicAI's new dictionary learning paper, which was just published as a public comment! It's a great paper and I'm glad the results hold up.

Here's a tutorial to use my autoencoders and interpret a feature
colab.research.google.com/drive/1u8larhp…
And you can read the blog post here:


I was particularly curious about how neuron-sparse the features are. Strikingly, I find that *most* (92%) of features are dense in the neuron basis! lesswrong.com/posts/fKuugaxt…
Image
Sep 24, 2023 13 tweets 3 min read
This paper's been doing the rounds, so I thought I'd give a mechanistic interpretability take on what's going on here!

The core intuition is that "When you see 'A is', output B" is implemented as an asymmetric look-up table, with an entry for A->B.
B->A would be a separate entry The key question to ask with a mystery like this about models is what algorithms are needed to get the correct answer, and how these can be implemented in transformer weights. These are what get reinforced when fine-tuning.
Mar 31, 2023 16 tweets 9 min read
In recent work @ke_li_2021 trained Othello-GPT to predict the next move in random legal games of Othello & found an emergent model of the board! Surprisingly, only non-linear probes worked. I found a linear model of which squares have the PLAYER'S colour!
lesswrong.com/s/nhGNHyJHbrof… Why was it surprising that linear probes failed before? To do mech interp we must really understand how models represent thoughts, and we often assume linear representations - features as directions. But this could be wrong, we don't have enough data on circuits to be confident!
Mar 17, 2023 5 tweets 4 min read
New blog post on an interpretability technique I helped develop when at @AnthropicAI: Attribution patching. This is a ludicrously fast gradient-based approximation to activation patching - in three passes, you could patch all 4.7M of GPT-3's neurons!
alignmentforum.org/posts/gtLLBhzQ… Image Activation patching/causal tracing was introduced by @jesse_vig and used in @davidbau & @mengk20's ROME and @kevrowan's IOI work. The technique "diffs" two prompts, identical apart from a key detail - patching individual activations from one to the other isolates the key circuits Image
Jan 21, 2023 11 tweets 7 min read
Excited to announce that our work, Progress Measures for Grokking via Mechanistic Interpretability, has been accepted as a spotlight at ICLR 23! (despite being rejected from Arxiv twice!)
This was significantly refined from my prior work, thoughts in 🧵
arxiv.org/abs/2301.05217 We trained a transformer to grok modular addition and reverse-engineered it. We found that it had learned a Fourier Transform and trig identity based algorithm, so cleanly that we can read it off the weights!

I did not expect this algorithm! I found it by reverse-engineering.
Dec 31, 2022 11 tweets 3 min read
Great question! In all honesty, we just don't really know. I consider this an open scientific question, and really want to get more data on it! Some thoughts: 🧵 IMO, my far the strongest success here is the induction heads work - see my thread about how wild those results are & my explainer. We found a simple circuit in a 2L attn-only model, which recurs in every model studied (up to 13B), and plays a crucial role
dynalist.io/d/n2ZWtnoYHrU1…
Dec 27, 2022 4 tweets 3 min read
New walkthrough: Toy Models of Superposition! We read through the paper, discuss the high-level intuitions, limitations, and takeaways of this great @AnthropicAI paper. I'm extremely impressed by the rich insights from so simple a model! (why tetrahedra?!)
We mostly focused on the conceptual takeaways in this video, let me know if you want a part 2 where we finish going through the details of the paper!

If you haven't come across the work before, check out Anthropic's thread summarising the paper:
Nov 22, 2022 12 tweets 7 min read
New paper walkthrough! @charles_irl and I read through, discuss and share intuitions for In-Context Learning and Induction Heads by @catherineols, @nelhage, me and @ch402. This is probably the most surprising paper I've been involved, a 🧵 of takeaways:
This video should be friendly to people who haven't read the paper! Let me know if you want a more in-the-weeds part 2. And if you enjoy this, check out my walkthrough of the prequel paper, A Mathematical Framework for Transformer Circuits!
Nov 2, 2022 13 tweets 5 min read
@kevrowan's paper is one of the coolest interp papers I've seen in a while, super excited about this line of work! They reverse engineer a 26 head circuit in GPT-2 Small, with 7 types of head and 4 layers composing. And @kevrowan is still in high school!! A 🧵 of my takeaways: Image 1: Deepest takeaway: This is possible! This is another data point that language model behaviour follows an interpretable algorithm, and we can localise where it happens. If @ch402's circuits agenda is wrong, then much of my research is useless, so it's nice to get more evidence!
Oct 14, 2022 6 tweets 3 min read
I think A Mathematical Framework for Transformer Circuits is the most awesome paper I've been involved in, but it's very long and dense, so I think a lot of people struggle with it. As an experiment, you can watch me read through the paper and give takes! I try to clarify points that are confusing, point to bits that are or are not worth the effort of engaging with deeply, explain bits I think are particularly exciting, and generally convey what I hope people take away from the paper!
Sep 15, 2022 13 tweets 6 min read
Great post by @aslvrstn! My main takeaway: Floating Points make Linear Algebra a leaky abstraction for deep learning

Some context in 🧵 1/
aslvrstn.com/posts/transfor… In @Tim_Dettmers' excellent work on running language models in int8, the part I found most confusing and surprising was that he found emergent features in the residual stream in the STANDARD basis (ie, the representation of the residual stream as floating points) 2/
Aug 15, 2022 20 tweets 8 min read
I've spent the past few months exploring @OpenAI's grokking result through the lens of mechanistic interpretability. I fully reverse engineered the modular addition model, and looked at what it does when training. So what's up with grokking? A 🧵... (1/17) alignmentforum.org/posts/N6WM6hs7… Takeaway 1: There's a deep relationship between grokking and phase changes. Phase changes are an abrupt change in capabilities during training, like we see when training a 2L attn-only transformer to predict repeated subsequences (2/17) Image