How to get URL link on X (Twitter) App
https://twitter.com/ch402/status/1709998674087227859And you can read the blog post here:
https://twitter.com/OwainEvans_UK/status/1705285631520407821The key question to ask with a mystery like this about models is what algorithms are needed to get the correct answer, and how these can be implemented in transformer weights. These are what get reinforced when fine-tuning.
https://twitter.com/NeelNanda5/status/1559060507524403200We trained a transformer to grok modular addition and reverse-engineered it. We found that it had learned a Fourier Transform and trig identity based algorithm, so cleanly that we can read it off the weights!
https://twitter.com/soniajoseph_/status/1609033710363361280IMO, my far the strongest success here is the induction heads work - see my thread about how wild those results are & my explainer. We found a simple circuit in a 2L attn-only model, which recurs in every model studied (up to 13B), and plays a crucial role
https://twitter.com/NeelNanda5/status/1595098727026286593
https://twitter.com/AnthropicAI/status/1570087876053942272
https://twitter.com/charles_irl/status/1595097774634274817This video should be friendly to people who haven't read the paper! Let me know if you want a more in-the-weeds part 2. And if you enjoy this, check out my walkthrough of the prequel paper, A Mathematical Framework for Transformer Circuits!
https://twitter.com/NeelNanda5/status/1580782930304978944
https://twitter.com/kevrowan/status/15876015326394941461: Deepest takeaway: This is possible! This is another data point that language model behaviour follows an interpretable algorithm, and we can localise where it happens. If @ch402's circuits agenda is wrong, then much of my research is useless, so it's nice to get more evidence!
https://twitter.com/Tim_Dettmers/status/1559892888326049792