Latest Twitter Threads by @OrgadHadas on Thread Reader App

Apr 13 • 11 tweets • 4 min read

New paper: LLMs encode harmful content generation in a distinct, unified mechanism

Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities.

🧵

Paper >>

We use weight pruning to probe model internals.

Result: pruning ~0.0005% of model parameters, harmful generation drops dramatically, while general capabilities remain largely intact. arxiv.org/abs/2604.09544

Share this page!

Enter URL or ID to Unroll