Latest Twitter Threads by @norabelrose on Thread Reader App

Feb 4 • 7 tweets • 2 min read

MLPs and GLUs are hard to interpret, but they make up most transformer parameters.

Linear and quadratic functions are easier to interpret.

We show how to convert MLPs & GLUs into polynomials in closed form, allowing you to use SVD and direct inspection for interpretability 🧵

We use SVD on linearized MLPs to generate adversarial examples, which transfer back to the original MLP!

This shows that our approximants are capturing the out-of-distribution behavior of the original network.

How is that possible?

Feb 3 • 8 tweets • 3 min read

What are the chances you'd get a fully functional language model by randomly guessing the weights?

We crunched the numbers and here's the answer:

My colleague @ascherlis and I developed a method for estimating the probability of sampling a neural network in a behaviorally-defined region from a Gaussian or uniform prior.

You can think of this as a measure of complexity: less probable, means more complex.

Jun 7, 2023 • 7 tweets • 4 min read

Ever wanted to mindwipe an LLM?

Our method, LEAst-squares Concept Erasure (LEACE), provably erases all linearly-encoded information about a concept from neural net activations. It does so surgically, inflicting minimal damage to other concepts. 🧵
arxiv.org/abs/2306.03819 (2/7) Concept erasure is an important tool for fairness, allowing us to prevent features like race or gender from being used by classifiers when inappropriate, as well as interpretability, letting us study the causal impact of features on a model's behavior.

Share this page!

Enter URL or ID to Unroll