Or Shafran Profile picture
Feb 5 8 tweets 3 min read
It's time to look past dictionary learning for decomposing LM activations.

What happens when we instead leverage local geometry?

We find a natural region-based decomposition that yields better steering and localization 🧵 1/ Our shift: From directions (Global) ➡️ regions (Local)

Using Mixtures of Factor Analyzers (MFA) by @ZoubinGhahrama1 we model activation space as a collection of low-rank Gaussians:

📍 Centroid (μ) = region (where you are)

📐 Low-rank subspace (W) = local variation 2/ Image