Nora Belrose Profile picture
AI, philosophy, spirituality Head of interpretability research at @AiEleuther, but tweets are my own views, not Eleuther’s.
Feb 4 7 tweets 2 min read
MLPs and GLUs are hard to interpret, but they make up most transformer parameters.

Linear and quadratic functions are easier to interpret.

We show how to convert MLPs & GLUs into polynomials in closed form, allowing you to use SVD and direct inspection for interpretability 🧵Image We use SVD on linearized MLPs to generate adversarial examples, which transfer back to the original MLP!

This shows that our approximants are capturing the out-of-distribution behavior of the original network.

How is that possible? Image
Jun 7, 2023 7 tweets 4 min read
Ever wanted to mindwipe an LLM?

Our method, LEAst-squares Concept Erasure (LEACE), provably erases all linearly-encoded information about a concept from neural net activations. It does so surgically, inflicting minimal damage to other concepts. 🧵
arxiv.org/abs/2306.03819 (2/7) Concept erasure is an important tool for fairness, allowing us to prevent features like race or gender from being used by classifiers when inappropriate, as well as interpretability, letting us study the causal impact of features on a model's behavior. Image