Tweet

Neel Nanda

Jan 21 • 11 tweets • 7 min read

https://twitter.com/NeelNanda5/status/1559060507524403200

Excited to announce that our work, Progress Measures for Grokking via Mechanistic Interpretability, has been accepted as a spotlight at ICLR 23! (despite being rejected from Arxiv twice!)
This was significantly refined from my prior work, thoughts in 🧵
arxiv.org/abs/2301.05217

https://twitter.com/NeelNanda5/status/1559060507524403200

We trained a transformer to grok modular addition and reverse-engineered it. We found that it had learned a Fourier Transform and trig identity based algorithm, so cleanly that we can read it off the weights!

I did not expect this algorithm! I found it by reverse-engineering.

Grokking is hard to study, because there are two valid solutions on train, memorising + generalising. They look the same! But once understood, they can be disentangled:
Restricted loss shows ONLY the generalising performance
Excluded loss shows ONLY the memorising performance

So what's behind grokking?
Three phases of training:
1 Memorization
2 Circuit formation: It smoothly TRANSITIONS from memorising to generalising
3 Cleanup: Removing the memorised solution

Test performance needs a general circuit AND no memorisation so Grokking occurs at cleanup!

To check that these results are robust and not cherry picked, we train a range of models across architectures, random seeds, and modular base, and see that our algorithm and the predicted phases of training are consistent.

Frontier models often show unexpected emergent behaviour, like arithmetic! Our work demonstrates a new approach: mechanistic explanation derived progress measures. If we know what we're looking for, even in a toy setting, we can design metrics to track the hidden progress.

The broader vision of this work is to apply mech interp to the science of deep learning. Neural networks are full of mysteries, but CAN be understood if we try hard enough. What further questions can be demystified by distilling simple examples? I'd love this for lottery tickets!

@AnthropicAI

@AnthropicAI recently applied a similar mindset to understanding memorisation and double descent. Some great work from Tom Henighan, @shancarter, @trishume, @nelhage and @ch402!

https://twitter.com/AnthropicAI/status/1611045993516249088

https://twitter.com/NeelNanda5/status/1608209599844478976

To my knowledge, this is one of the first papers from the mech interp community at a top ML conference, and I hope it's the start of many more! There's a lot of work to be done, and I'd love to see more engagement with academia.

I lay out some directions I'm excited about here:

https://twitter.com/NeelNanda5/status/1608209599844478976

@lieberum_t

Thanks a lot to my coauthors, @lieberum_t, Jess Smith, @JacobSteinhardt and especially @justanotherlaw, without whom this paper would have never happened.

@justanotherlaw

Check out @justanotherlaw's thread for what's new from the original version

And check out our website for interactive versions of the main figures (I will hopefully make the formatting less jank at some point)
progress-measures-grokking.io

https://twitter.com/justanotherlaw/status/1616591243664031744

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @NeelNanda5

Neel Nanda

@NeelNanda5

Nov 22, 2022

@charles_irl

New paper walkthrough! @charles_irl and I read through, discuss and share intuitions for In-Context Learning and Induction Heads by @catherineols, @nelhage, me and @ch402. This is probably the most surprising paper I've been involved, a 🧵 of takeaways:

https://twitter.com/charles_irl/status/1595097774634274817

https://twitter.com/NeelNanda5/status/1580782930304978944

This video should be friendly to people who haven't read the paper! Let me know if you want a more in-the-weeds part 2. And if you enjoy this, check out my walkthrough of the prequel paper, A Mathematical Framework for Transformer Circuits!

https://twitter.com/NeelNanda5/status/1580782930304978944

So what IS an induction head? They're are a type of transformer attention head that detect and continues repeated subsequences. It takes a simple circuit of two heads in different layers working together to do this, and we can fully reverse engineer how the model does this.

Read 12 tweets

Neel Nanda

@NeelNanda5

Oct 14, 2022

I think A Mathematical Framework for Transformer Circuits is the most awesome paper I've been involved in, but it's very long and dense, so I think a lot of people struggle with it. As an experiment, you can watch me read through the paper and give takes!

I try to clarify points that are confusing, point to bits that are or are not worth the effort of engaging with deeply, explain bits I think are particularly exciting, and generally convey what I hope people take away from the paper!

Sadly, it turns out I have a LOT to say about Transformer Circuits, so this turned into me monologuing for 3 hours... But I hope this was still useful! I'd love to get feedback - this was a lot of fun to make, but I have no idea whether this format is actually useful!

Read 6 tweets

Neel Nanda

@NeelNanda5

Aug 15, 2022

@OpenAI

I've spent the past few months exploring @OpenAI's grokking result through the lens of mechanistic interpretability. I fully reverse engineered the modular addition model, and looked at what it does when training. So what's up with grokking? A 🧵... (1/17) alignmentforum.org/posts/N6WM6hs7…

Takeaway 1: There's a deep relationship between grokking and phase changes. Phase changes are an abrupt change in capabilities during training, like we see when training a 2L attn-only transformer to predict repeated subsequences (2/17)

Phase changes turn into grokking if we regularise and add JUST enough data that it still generalises - here's the same problem on 512 training data points (3/17)

Read 19 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Neel Nanda

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @NeelNanda5

Neel Nanda

Neel Nanda

Neel Nanda

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!