Goodfire Profile picture
Jun 25 7 tweets 3 min read Read on X
We removed an LM's ability to speak German by fine-tuning on only 4 German tokens.

As part of a 1-day hackathon with our product Silico, we removed a 67M-parameter language model's ability to predict German text, by tuning only a scalar factor on one subcomponent of the weights. (1/6)Image
This was an early exploration in fine-tuning with *parameter decomposition* (see quote), our method which divides a model's weight matrices into interpretable, sparsely-activating components.

We picked German as it seemed to be the model's strongest non-English language. (2/6)
We benchmarked vs LoRA fine-tuning. Our edit matched its German removal with far fewer tokens.

Strikingly, it also left other languages almost untouched.

The LoRAs often wrecked French, Spanish, Italian, and sometimes English, while our edit mostly left them alone. (3/6) Image
In a sense this is cheating: we're indirectly exploiting the tokens from when we did the parameter decomposition and interpreted the resulting subcomponents.

But if our decomposition is good, that cost can be amortized over arbitrarily many tasks & component edits. (4/6)
Plus, that interpretability lets us notice and fix problems.

E.g.: initially we tuned the top 16 German-related components, but their labels showed most were about foreign languages in general.

So we narrowed to the single component for German alone, improving precision. (5/6)
This is an early demo of how parameter decomposition could enable targeted, predictable model editing.

Details on this experiment: lesswrong.com/posts/ieoWstub…

If you want to run experiments on your model too, learn more and request access to Silico: goodfire.ai/silico
Correction: a plotting error caused the bars in the plot of off-target effects to display at 0.01 nats above the true means. The corrected plot is below: Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Goodfire

Goodfire Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @GoodfireAI

Jun 23
Stories have shapes: a comedy rises toward joy; a tragedy falls into loss.

Inside an LLM, that’s visible more literally: as an LLM reads a story, its internal activations trace a wandering path that reflects the model’s sense of what kind of story it is reading. (1/5)
A story's emotions shift sentence to sentence. To see if the model keeps up, we stop after each sentence and:

- Harvest the internal activations from the last token.
- Ask it to rate surprise, disgust, anger, happiness, sadness, and fear so far, 0–10

(2/5) Image
Both approaches show that the model tracks the story's emotional arc — in its stated ratings, and in its internal geometry, where the activations wander along a curved manifold of emotions. (3/5)
Read 5 tweets
Jun 11
Have you debugged your training data? You might not like what you find.

Introducing predictive data debugging: reveal and shape what your model will learn before training.

In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)
Predictive data debugging reveals which behaviors DPO will amplify or suppress before you train (R² = 0.9 vs what the model actually learns).

It then traces behaviors to responsible data, and modulates learning to prevent undesired effects. (2/9) Image
The key idea: interpreting a model also lets us interpret a dataset.

Passing data through an interpreted model reveals what the model computes when processing each example.

Those concepts predict what the model will move toward, or away from, if you train on that data. (3/9)
Read 10 tweets
May 7
Neural networks might speak English, but they think in shapes.

Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision.

Starting today, we’re releasing a series of posts on this research agenda. 🧵
Just as the real world is highly structured, neural networks are full of rich geometric structure: time, space, numbers, color, the tree of life, new biomarkers, and more are represented along curved paths and surfaces.

This is true across models, modalities, and domains! (2/8)
New methods to understand this “neural geometry” are a crucial frontier in understanding, improving, and controlling models. (3/8)
Read 8 tweets
Apr 30
Introducing Silico: the platform for building AI models with the precision of written software.

Silico lets researchers and engineers see inside their models, debug failures, and intentionally design them from the ground up.

Early access is open now. 🧵(1/10)
We’ve used interpretability to discover a novel class of Alzheimer’s biomarkers, teach a language model to correct its own hallucinations, and diagnose performance bottlenecks in a robotics model.

Silico brings those frontier techniques to everyone. (2/10)
Silico introduces our model neuroscientist: an autonomous agent that plans and runs concurrent experiments on your model.

It works with your team in our model design environment, where you can organize research threads, replicate and extend papers, and collaborate on findings.

Here are 5 things you can do with Silico:
(3/10)
Read 10 tweets
Apr 14
We achieved state-of-the-art performance in predicting which of 4.2 million genetic variants cause diseases by interpreting a genomics model, in a new preprint with @MayoClinic.

We're now releasing an open source database for all variants in the NIH's clinvar database. 🧵(1/8) Image
The core challenge of genomic medicine is figuring out what genetic variants actually do - so we can diagnose and treat resulting diseases. But many of the millions of variants in clinical databases are still “variants of uncertain significance” (VUS). (2/8)
By training our new covariance probes on @arcinstitute’s Evo 2 - a genomic foundation model trained on massive DNA data - we achieve state-of-the-art prediction of whether variants cause disease, with strong generalization across variant types. (3/8)
Read 9 tweets
Jan 28
We've identified a novel class of biomarkers for Alzheimer's detection - using interpretability - with @PrimaMente.

How we did it, and how interpretability can power scientific discovery in the age of digital biology: (1/6) Image
Bio foundation models (e.g. AlphaFold) can achieve superhuman performance, so they must contain novel scientific knowledge. @PrimaMente's Pleiades epigenetics model is one such case - it's SOTA on early Alzheimer's detection.

But that knowledge is locked inside a black box (2/6)
Interpretability is the key to unlock that knowledge, extracting what the Pleiades model knows about epigenetics and Alzheimer's (or anything else!)

It's the missing step between black-box predictive power and true scientific understanding. (3/6)
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(