We removed an LM's ability to speak German by fine-tuning on only 4 German tokens.
As part of a 1-day hackathon with our product Silico, we removed a 67M-parameter language model's ability to predict German text, by tuning only a scalar factor on one subcomponent of the weights. (1/6)
This was an early exploration in fine-tuning with *parameter decomposition* (see quote), our method which divides a model's weight matrices into interpretable, sparsely-activating components.
We picked German as it seemed to be the model's strongest non-English language. (2/6)
We benchmarked vs LoRA fine-tuning. Our edit matched its German removal with far fewer tokens.
Strikingly, it also left other languages almost untouched.
The LoRAs often wrecked French, Spanish, Italian, and sometimes English, while our edit mostly left them alone. (3/6)
In a sense this is cheating: we're indirectly exploiting the tokens from when we did the parameter decomposition and interpreted the resulting subcomponents.
But if our decomposition is good, that cost can be amortized over arbitrarily many tasks & component edits. (4/6)
Plus, that interpretability lets us notice and fix problems.
E.g.: initially we tuned the top 16 German-related components, but their labels showed most were about foreign languages in general.
So we narrowed to the single component for German alone, improving precision. (5/6)
This is an early demo of how parameter decomposition could enable targeted, predictable model editing.
If you want to run experiments on your model too, learn more and request access to Silico: goodfire.ai/silico
Correction: a plotting error caused the bars in the plot of off-target effects to display at 0.01 nats above the true means. The corrected plot is below:
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Stories have shapes: a comedy rises toward joy; a tragedy falls into loss.
Inside an LLM, that’s visible more literally: as an LLM reads a story, its internal activations trace a wandering path that reflects the model’s sense of what kind of story it is reading. (1/5)
A story's emotions shift sentence to sentence. To see if the model keeps up, we stop after each sentence and:
- Harvest the internal activations from the last token.
- Ask it to rate surprise, disgust, anger, happiness, sadness, and fear so far, 0–10
(2/5)
Both approaches show that the model tracks the story's emotional arc — in its stated ratings, and in its internal geometry, where the activations wander along a curved manifold of emotions. (3/5)
Neural networks might speak English, but they think in shapes.
Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision.
Starting today, we’re releasing a series of posts on this research agenda. 🧵
Just as the real world is highly structured, neural networks are full of rich geometric structure: time, space, numbers, color, the tree of life, new biomarkers, and more are represented along curved paths and surfaces.
This is true across models, modalities, and domains! (2/8)
New methods to understand this “neural geometry” are a crucial frontier in understanding, improving, and controlling models. (3/8)
Introducing Silico: the platform for building AI models with the precision of written software.
Silico lets researchers and engineers see inside their models, debug failures, and intentionally design them from the ground up.
Early access is open now. 🧵(1/10)
We’ve used interpretability to discover a novel class of Alzheimer’s biomarkers, teach a language model to correct its own hallucinations, and diagnose performance bottlenecks in a robotics model.
Silico brings those frontier techniques to everyone. (2/10)
Silico introduces our model neuroscientist: an autonomous agent that plans and runs concurrent experiments on your model.
It works with your team in our model design environment, where you can organize research threads, replicate and extend papers, and collaborate on findings.
We achieved state-of-the-art performance in predicting which of 4.2 million genetic variants cause diseases by interpreting a genomics model, in a new preprint with @MayoClinic.
We're now releasing an open source database for all variants in the NIH's clinvar database. 🧵(1/8)
The core challenge of genomic medicine is figuring out what genetic variants actually do - so we can diagnose and treat resulting diseases. But many of the millions of variants in clinical databases are still “variants of uncertain significance” (VUS). (2/8)
By training our new covariance probes on @arcinstitute’s Evo 2 - a genomic foundation model trained on massive DNA data - we achieve state-of-the-art prediction of whether variants cause disease, with strong generalization across variant types. (3/8)
We've identified a novel class of biomarkers for Alzheimer's detection - using interpretability - with @PrimaMente.
How we did it, and how interpretability can power scientific discovery in the age of digital biology: (1/6)
Bio foundation models (e.g. AlphaFold) can achieve superhuman performance, so they must contain novel scientific knowledge. @PrimaMente's Pleiades epigenetics model is one such case - it's SOTA on early Alzheimer's detection.
But that knowledge is locked inside a black box (2/6)
Interpretability is the key to unlock that knowledge, extracting what the Pleiades model knows about epigenetics and Alzheimer's (or anything else!)
It's the missing step between black-box predictive power and true scientific understanding. (3/6)