Hadas Orgad Profile picture
Research Fellow @ Kempner Institute, Harvard | Interested in AI interpretability, robustness & safety
Apr 13 11 tweets 4 min read
New paper: LLMs encode harmful content generation in a distinct, unified mechanism

Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities.

🧵 Image Paper >>

We use weight pruning to probe model internals.

Result: pruning ~0.0005% of model parameters, harmful generation drops dramatically, while general capabilities remain largely intact. arxiv.org/abs/2604.09544Image
Image
Image