How to get URL link on X (Twitter) App
https://twitter.com/ch402/status/1709998674087227859And you can read the blog post here:
https://twitter.com/OwainEvans_UK/status/1705285631520407821The key question to ask with a mystery like this about models is what algorithms are needed to get the correct answer, and how these can be implemented in transformer weights. These are what get reinforced when fine-tuning.
https://twitter.com/NeelNanda5/status/1559060507524403200We trained a transformer to grok modular addition and reverse-engineered it. We found that it had learned a Fourier Transform and trig identity based algorithm, so cleanly that we can read it off the weights!
https://twitter.com/soniajoseph_/status/1609033710363361280IMO, my far the strongest success here is the induction heads work - see my thread about how wild those results are & my explainer. We found a simple circuit in a 2L attn-only model, which recurs in every model studied (up to 13B), and plays a crucial role
https://twitter.com/NeelNanda5/status/1595098727026286593
https://twitter.com/AnthropicAI/status/1570087876053942272
https://twitter.com/charles_irl/status/1595097774634274817This video should be friendly to people who haven't read the paper! Let me know if you want a more in-the-weeds part 2. And if you enjoy this, check out my walkthrough of the prequel paper, A Mathematical Framework for Transformer Circuits!
https://twitter.com/NeelNanda5/status/1580782930304978944
https://twitter.com/kevrowan/status/1587601532639494146
https://twitter.com/Tim_Dettmers/status/1559892888326049792