Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Neel Nanda

@NeelNanda5

Jan 21, 2023 • 11 tweets • 7 min read • Read on X

Scrolly

https://twitter.com/NeelNanda5/status/1559060507524403200

Excited to announce that our work, Progress Measures for Grokking via Mechanistic Interpretability, has been accepted as a spotlight at ICLR 23! (despite being rejected from Arxiv twice!)
This was significantly refined from my prior work, thoughts in 🧵
arxiv.org/abs/2301.05217

https://twitter.com/NeelNanda5/status/1559060507524403200

We trained a transformer to grok modular addition and reverse-engineered it. We found that it had learned a Fourier Transform and trig identity based algorithm, so cleanly that we can read it off the weights!

I did not expect this algorithm! I found it by reverse-engineering.

Grokking is hard to study, because there are two valid solutions on train, memorising + generalising. They look the same! But once understood, they can be disentangled:
Restricted loss shows ONLY the generalising performance
Excluded loss shows ONLY the memorising performance

So what's behind grokking?
Three phases of training:
1 Memorization
2 Circuit formation: It smoothly TRANSITIONS from memorising to generalising
3 Cleanup: Removing the memorised solution

Test performance needs a general circuit AND no memorisation so Grokking occurs at cleanup!

To check that these results are robust and not cherry picked, we train a range of models across architectures, random seeds, and modular base, and see that our algorithm and the predicted phases of training are consistent.

Frontier models often show unexpected emergent behaviour, like arithmetic! Our work demonstrates a new approach: mechanistic explanation derived progress measures. If we know what we're looking for, even in a toy setting, we can design metrics to track the hidden progress.

The broader vision of this work is to apply mech interp to the science of deep learning. Neural networks are full of mysteries, but CAN be understood if we try hard enough. What further questions can be demystified by distilling simple examples? I'd love this for lottery tickets!

@AnthropicAI

@AnthropicAI recently applied a similar mindset to understanding memorisation and double descent. Some great work from Tom Henighan, @shancarter, @trishume, @nelhage and @ch402!

https://twitter.com/AnthropicAI/status/1611045993516249088

https://twitter.com/NeelNanda5/status/1608209599844478976

To my knowledge, this is one of the first papers from the mech interp community at a top ML conference, and I hope it's the start of many more! There's a lot of work to be done, and I'd love to see more engagement with academia.

I lay out some directions I'm excited about here:

https://twitter.com/NeelNanda5/status/1608209599844478976

@lieberum_t

Thanks a lot to my coauthors, @lieberum_t, Jess Smith, @JacobSteinhardt and especially @justanotherlaw, without whom this paper would have never happened.

@justanotherlaw

Check out @justanotherlaw's thread for what's new from the original version

And check out our website for interactive versions of the main figures (I will hopefully make the formatting less jank at some point)
progress-measures-grokking.io

https://twitter.com/justanotherlaw/status/1616591243664031744

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @NeelNanda5

Neel Nanda

@NeelNanda5

Dec 1, 2025

The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability

Our post details how we now do research, why now is the time to pivot, why we expect this way to have more impact and why we think other interp researchers should follow suit

Core principles:
Doing good research is hard, it's easy to be misled
Doing *impactful* research is harder - we care about AGI but it doesn't exist yet!

Solution: measure progress with empirical proxy tasks that track progress towards a North Star

Post: alignmentforum.org/posts/StENzDcD…

Does this mean giving up on curiosity? No! Curiosity is a powerful driver of research progress. But exciting ≠ true, let alone important. It needs grounding:
- Start in an important setting
- Time-box exploration (eg 2 wks)
- Fail fast - eventually find a proxy task, or move on

Read 8 tweets

Neel Nanda

@NeelNanda5

Jul 31, 2024

Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training costs limit research

Announcing Gemma Scope: An open suite of SAEs on every layer & sublayer of Gemma 2 2B & 9B! We hope to enable even more ambitious work

Want to learn more? @neuronpedia have made a gorgeous interactive demo walking you through what Sparse Autoencoders are, and what Gemma Scope can do.

If this could happen pre-launch, I'm excited to see what the community will do with Gemma Scope now!
neuronpedia.org/gemma-scope

A challenge: What kind of rich, beautiful features can you find inside Gemma Scope, using the demo? Here's a feature we found that seems to be something to do with idioms:

Read 9 tweets

Neel Nanda

@NeelNanda5

Dec 23, 2023

My first @GoogleDeepMind project: How do LLMs recall facts?

Early MLP layers act as a lookup table, with significant superposition! They recognise entities and produce their attributes as directions. We suggest viewing fact recall as a black box making "multi-token embeddings”

Our hope was to understand a circuit in superposition at the parameter level, but we failed at this. We carefully falsify several naive hypotheses, but fact recall seems pretty cursed. We can black box the lookup part, so this doesn't sink the mech interp agenda, but it's a blow.

Importantly, though we failed to understand *how* MLP neurons look up tokens to attributes, we think that *once* the attributes are looked up, they are interpretable, and there’s important work to be done (eg with Sparse Autoencoders) decoding them.

Read 11 tweets

Neel Nanda

@NeelNanda5

Oct 24, 2023

https://twitter.com/ch402/status/1709998674087227859

I recently did an open source replication on @AnthropicAI's new dictionary learning paper, which was just published as a public comment! It's a great paper and I'm glad the results hold up.

Here's a tutorial to use my autoencoders and interpret a feature
colab.research.google.com/drive/1u8larhp…

https://twitter.com/ch402/status/1709998674087227859

And you can read the blog post here:

I was particularly curious about how neuron-sparse the features are. Strikingly, I find that *most* (92%) of features are dense in the neuron basis! lesswrong.com/posts/fKuugaxt…

I also replicated the bizarre result of a lot of features being uninterpretable and "ultra-low frequency". I further found that, in my model, these all had the same encoder direction! Further, different autoencoders learn the same direction. I'm confused about what's going on.

Read 4 tweets

Neel Nanda

@NeelNanda5

Sep 24, 2023

https://twitter.com/OwainEvans_UK/status/1705285631520407821

This paper's been doing the rounds, so I thought I'd give a mechanistic interpretability take on what's going on here!

The core intuition is that "When you see 'A is', output B" is implemented as an asymmetric look-up table, with an entry for A->B.
B->A would be a separate entry

https://twitter.com/OwainEvans_UK/status/1705285631520407821

The key question to ask with a mystery like this about models is what algorithms are needed to get the correct answer, and how these can be implemented in transformer weights. These are what get reinforced when fine-tuning.

The two hard parts of "A is B" are recognising the input tokens A (out of all possible input tokens) and connecting this to the action to output tokens B (out of all possible output tokens). These are both hard!

Further, the A -> B look-up must happen on a single token position

Read 13 tweets

Neel Nanda

@NeelNanda5

Mar 31, 2023

@ke_li_2021

In recent work @ke_li_2021 trained Othello-GPT to predict the next move in random legal games of Othello & found an emergent model of the board! Surprisingly, only non-linear probes worked. I found a linear model of which squares have the PLAYER'S colour!
lesswrong.com/s/nhGNHyJHbrof…

Why was it surprising that linear probes failed before? To do mech interp we must really understand how models represent thoughts, and we often assume linear representations - features as directions. But this could be wrong, we don't have enough data on circuits to be confident!

@ch402

The paper seemed evidence against: a genuinely non-linear representation! My findings show the hypothesis has predictive power and survived falsification.

This was genuinely in doubt! Independently, @ch402 and @wattenberg pre-registered predictions in different directions here

Read 16 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Neel Nanda

Try unrolling a thread yourself!

More from @NeelNanda5

Neel Nanda

Neel Nanda

Neel Nanda

Neel Nanda

Neel Nanda

Neel Nanda

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!