Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Nora Belrose

@norabelrose

Jun 7, 2023 • 7 tweets • 4 min read • Read on X

Scrolly

Ever wanted to mindwipe an LLM?

Our method, LEAst-squares Concept Erasure (LEACE), provably erases all linearly-encoded information about a concept from neural net activations. It does so surgically, inflicting minimal damage to other concepts. 🧵
arxiv.org/abs/2306.03819

(2/7) Concept erasure is an important tool for fairness, allowing us to prevent features like race or gender from being used by classifiers when inappropriate, as well as interpretability, letting us study the causal impact of features on a model's behavior.

(3/7) We also introduce a procedure called “concept scrubbing,” which applies LEACE to all layers of a deep network simultaneously. We find LLM performance depends heavily on linear part-of-speech information, while erasing a random feature has little to no effect.

(4/7) We prove that LEACE is the smallest possible linear edit, in the least squares sense, needed to erase a concept— all previous concept erasure methods have been suboptimal. We also show empirically that it’s less destructive to model performance than previous methods.

(5/7) LEACE has a closed-form solution that fits on a T-shirt. This makes it orders of magnitude faster than popular concept erasure methods like INLP and R-LACE, which require gradient-based optimization. And the solution can be efficiently updated to accommodate new data.

(6/7) We’ve released all code needed to reproduce our results at github.com/EleutherAI/con…! You can also `pip install concept-erasure` to get the PyPI package.

@TheDavidSJ

(7/7) LEACE wouldn’t be possible without @TheDavidSJ, who proved the theorem that led to this paper. I'd also like to thank our other coauthors @ravfogel @ryandcotterell @EdwardRaffML @BlancheMinerva!

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @norabelrose

Nora Belrose

@norabelrose

Feb 4

MLPs and GLUs are hard to interpret, but they make up most transformer parameters.

Linear and quadratic functions are easier to interpret.

We show how to convert MLPs & GLUs into polynomials in closed form, allowing you to use SVD and direct inspection for interpretability 🧵

We use SVD on linearized MLPs to generate adversarial examples, which transfer back to the original MLP!

This shows that our approximants are capturing the out-of-distribution behavior of the original network.

How is that possible?

We compute the least-squares approximation to the original MLP, assuming the inputs are Gaussian mixture distributed.

Gaussianity lets us make some assumptions about the inputs (mean & covariance), without overfitting to it. We capture features of the network, not the data.

Read 7 tweets

Nora Belrose

@norabelrose

Feb 3

What are the chances you'd get a fully functional language model by randomly guessing the weights?

We crunched the numbers and here's the answer:

My colleague @ascherlis and I developed a method for estimating the probability of sampling a neural network in a behaviorally-defined region from a Gaussian or uniform prior.

You can think of this as a measure of complexity: less probable, means more complex.

It works by exploring random directions in weight space, starting from an "anchor" network.

The distance from the anchor to the edge of the region, along the random direction, gives us an estimate of how big (or how probable) the region is as a whole.

Read 8 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Nora Belrose

Try unrolling a thread yourself!

More from @norabelrose

Nora Belrose

Nora Belrose

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!