How to get URL link on X (Twitter) App
Core principles:
Our hope was to understand a circuit in superposition at the parameter level, but we failed at this. We carefully falsify several naive hypotheses, but fact recall seems pretty cursed. We can black box the lookup part, so this doesn't sink the mech interp agenda, but it's a blow.
https://twitter.com/ch402/status/1709998674087227859And you can read the blog post here:
https://twitter.com/OwainEvans_UK/status/1705285631520407821The key question to ask with a mystery like this about models is what algorithms are needed to get the correct answer, and how these can be implemented in transformer weights. These are what get reinforced when fine-tuning.
Why was it surprising that linear probes failed before? To do mech interp we must really understand how models represent thoughts, and we often assume linear representations - features as directions. But this could be wrong, we don't have enough data on circuits to be confident!
Activation patching/causal tracing was introduced by @jesse_vig and used in @davidbau & @mengk20's ROME and @kevrowan's IOI work. The technique "diffs" two prompts, identical apart from a key detail - patching individual activations from one to the other isolates the key circuits
https://twitter.com/NeelNanda5/status/1559060507524403200We trained a transformer to grok modular addition and reverse-engineered it. We found that it had learned a Fourier Transform and trig identity based algorithm, so cleanly that we can read it off the weights!



https://twitter.com/soniajoseph_/status/1609033710363361280IMO, my far the strongest success here is the induction heads work - see my thread about how wild those results are & my explainer. We found a simple circuit in a 2L attn-only model, which recurs in every model studied (up to 13B), and plays a crucial role
https://twitter.com/NeelNanda5/status/1595098727026286593
We got good insights studying tiny language models in A Mathematical Framework, notably finding induction heads (so interesting we wrote a whole paper on them!) I expect there's still much to learn! I've open sourced 12 toy models: 1-4L attn-only/GELU/SoLU
https://twitter.com/AnthropicAI/status/1570087876053942272
https://twitter.com/charles_irl/status/1595097774634274817This video should be friendly to people who haven't read the paper! Let me know if you want a more in-the-weeds part 2. And if you enjoy this, check out my walkthrough of the prequel paper, A Mathematical Framework for Transformer Circuits!
https://twitter.com/NeelNanda5/status/1580782930304978944
https://twitter.com/kevrowan/status/1587601532639494146
1: Deepest takeaway: This is possible! This is another data point that language model behaviour follows an interpretable algorithm, and we can localise where it happens. If @ch402's circuits agenda is wrong, then much of my research is useless, so it's nice to get more evidence!
https://twitter.com/Tim_Dettmers/status/1559892888326049792