Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Goodfire

@GoodfireAI

Jun 25 • 7 tweets • 3 min read • Read on X

Scrolly

We removed an LM's ability to speak German by fine-tuning on only 4 German tokens.

As part of a 1-day hackathon with our product Silico, we removed a 67M-parameter language model's ability to predict German text, by tuning only a scalar factor on one subcomponent of the weights. (1/6)

https://twitter.com/3079387847/status/2051717264286609516

This was an early exploration in fine-tuning with *parameter decomposition* (see quote), our method which divides a model's weight matrices into interpretable, sparsely-activating components.

We picked German as it seemed to be the model's strongest non-English language. (2/6)

https://twitter.com/3079387847/status/2051717264286609516

We benchmarked vs LoRA fine-tuning. Our edit matched its German removal with far fewer tokens.

Strikingly, it also left other languages almost untouched.

The LoRAs often wrecked French, Spanish, Italian, and sometimes English, while our edit mostly left them alone. (3/6)

In a sense this is cheating: we're indirectly exploiting the tokens from when we did the parameter decomposition and interpreted the resulting subcomponents.

But if our decomposition is good, that cost can be amortized over arbitrarily many tasks & component edits. (4/6)

Plus, that interpretability lets us notice and fix problems.

E.g.: initially we tuned the top 16 German-related components, but their labels showed most were about foreign languages in general.

So we narrowed to the single component for German alone, improving precision. (5/6)

This is an early demo of how parameter decomposition could enable targeted, predictable model editing.

Details on this experiment: lesswrong.com/posts/ieoWstub…

If you want to run experiments on your model too, learn more and request access to Silico: goodfire.ai/silico

Correction: a plotting error caused the bars in the plot of off-target effects to display at 0.01 nats above the true means. The corrected plot is below:

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @GoodfireAI

Goodfire

@GoodfireAI

Jun 23

https://twitter.com/1818809548037062656/status/2052420446910644616

Stories have shapes: a comedy rises toward joy; a tragedy falls into loss.

Inside an LLM, that’s visible more literally: as an LLM reads a story, its internal activations trace a wandering path that reflects the model’s sense of what kind of story it is reading. (1/5)

https://twitter.com/1818809548037062656/status/2052420446910644616

A story's emotions shift sentence to sentence. To see if the model keeps up, we stop after each sentence and:

- Harvest the internal activations from the last token.
- Ask it to rate surprise, disgust, anger, happiness, sadness, and fear so far, 0–10

(2/5)

Both approaches show that the model tracks the story's emotional arc — in its stated ratings, and in its internal geometry, where the activations wander along a curved manifold of emotions. (3/5)

Read 5 tweets

Goodfire

@GoodfireAI

Jun 11

Have you debugged your training data? You might not like what you find.

Introducing predictive data debugging: reveal and shape what your model will learn before training.

In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)

Predictive data debugging reveals which behaviors DPO will amplify or suppress before you train (R² = 0.9 vs what the model actually learns).

It then traces behaviors to responsible data, and modulates learning to prevent undesired effects. (2/9)

The key idea: interpreting a model also lets us interpret a dataset.

Passing data through an interpreted model reveals what the model computes when processing each example.

Those concepts predict what the model will move toward, or away from, if you train on that data. (3/9)

Read 10 tweets

Goodfire

@GoodfireAI

May 7

Neural networks might speak English, but they think in shapes.

Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision.

Starting today, we’re releasing a series of posts on this research agenda. 🧵

Just as the real world is highly structured, neural networks are full of rich geometric structure: time, space, numbers, color, the tree of life, new biomarkers, and more are represented along curved paths and surfaces.

This is true across models, modalities, and domains! (2/8)

New methods to understand this “neural geometry” are a crucial frontier in understanding, improving, and controlling models. (3/8)

Read 8 tweets

Goodfire

@GoodfireAI

Apr 30

Introducing Silico: the platform for building AI models with the precision of written software.

Silico lets researchers and engineers see inside their models, debug failures, and intentionally design them from the ground up.

Early access is open now. 🧵(1/10)

We’ve used interpretability to discover a novel class of Alzheimer’s biomarkers, teach a language model to correct its own hallucinations, and diagnose performance bottlenecks in a robotics model.

Silico brings those frontier techniques to everyone. (2/10)

Silico introduces our model neuroscientist: an autonomous agent that plans and runs concurrent experiments on your model.

It works with your team in our model design environment, where you can organize research threads, replicate and extend papers, and collaborate on findings.

Here are 5 things you can do with Silico:
(3/10)

Read 10 tweets

Goodfire

@GoodfireAI

Apr 14

We achieved state-of-the-art performance in predicting which of 4.2 million genetic variants cause diseases by interpreting a genomics model, in a new preprint with @MayoClinic.

We're now releasing an open source database for all variants in the NIH's clinvar database. 🧵(1/8)

The core challenge of genomic medicine is figuring out what genetic variants actually do - so we can diagnose and treat resulting diseases. But many of the millions of variants in clinical databases are still “variants of uncertain significance” (VUS). (2/8)

By training our new covariance probes on @arcinstitute’s Evo 2 - a genomic foundation model trained on massive DNA data - we achieve state-of-the-art prediction of whether variants cause disease, with strong generalization across variant types. (3/8)

Read 9 tweets

Goodfire

@GoodfireAI

Jan 28

We've identified a novel class of biomarkers for Alzheimer's detection - using interpretability - with @PrimaMente.

How we did it, and how interpretability can power scientific discovery in the age of digital biology: (1/6)

Bio foundation models (e.g. AlphaFold) can achieve superhuman performance, so they must contain novel scientific knowledge. @PrimaMente's Pleiades epigenetics model is one such case - it's SOTA on early Alzheimer's detection.

But that knowledge is locked inside a black box (2/6)

Interpretability is the key to unlock that knowledge, extracting what the Pleiades model knows about epigenetics and Alzheimer's (or anything else!)

It's the missing step between black-box predictive power and true scientific understanding. (3/6)

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Goodfire

Try unrolling a thread yourself!

More from @GoodfireAI

Goodfire

Goodfire

Goodfire

Goodfire

Goodfire

Goodfire

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!