How to get URL link on X (Twitter) App
https://twitter.com/AnthropicAI/status/1792935506587656625I've been working on interpretability for more than a decade, significantly motivated by concerns about safety. But it's always been this aspirational goal – I could tell a story for how this work might someday help, but it was far off.
https://twitter.com/AnthropicAI/status/1709986949711200722Well trained sparse autoencoders (scale matters!) can decompose a one-layer model into very nice, interpretable features.
https://twitter.com/AToliasLab/status/1659236747618680836When we were working on vision models, we constantly discovered that features we were discovering such as curve detectors (distill.pub/2020/circuits/…) and multimodal neurons (distill.pub/2021/multimoda…) had existing parallels in neuroscience.
https://twitter.com/zeeweeding/status/1638257954398011394(Caveat that I'm not super well read in neuroscience. When working on high-low frequency detectors we asked a number of neuroscientists if they'd seen such neurons before, and they hadn't. But very possible something was missed!)
https://twitter.com/AnthropicAI/status/1611045993516249088Historically, I think I've often been tempted to think of overfit models as pathological things that need to be trained better in order for us to interpret.
https://twitter.com/NeelNanda5/status/1609283649119322114For background, there's a famous serious of books in math titled "counterexamples in X" (eg. the famous "Counterexamples in Topology") which offer examples of mathematical objects with unusual properties.
https://twitter.com/AnthropicAI/status/1570087876053942272Firstly: Superposition by itself is utterly wild if you take it seriously.
https://twitter.com/ch402/status/1564631228166201345(More on this below, but please don't let this thread discourage you from reaching out if you think we might be a good fit! I intend to respond to every sincere message I receive and if we aren't a fit, I'll try to think about whether I know someone who's a better match for you.)
https://twitter.com/AnthropicAI/status/1541468008249364481Our team's goal is to reverse engineer neural networks into human understandable computer programs – see our work at transformer-circuits.pub and some previous work we're inspired by at distill.pub/2020/circuits/ .
https://twitter.com/banburismus_/status/1532747777280593920I used to really want ML to be about complex math and clever proofs. But I've gradually come to think this is really the wrong aesthetic to bring.
https://twitter.com/saykay/status/1480022995393413120From the related work section of Habbema et al (ncbi.nlm.nih.gov/pmc/articles/P…):