Goodfire Profile picture
Using interpretability to understand, learn from, and design AI.
Jan 28 6 tweets 2 min read
We've identified a novel class of biomarkers for Alzheimer's detection - using interpretability - with @PrimaMente.

How we did it, and how interpretability can power scientific discovery in the age of digital biology: (1/6) Image Bio foundation models (e.g. AlphaFold) can achieve superhuman performance, so they must contain novel scientific knowledge. @PrimaMente's Pleiades epigenetics model is one such case - it's SOTA on early Alzheimer's detection.

But that knowledge is locked inside a black box (2/6)
Nov 11, 2025 4 tweets 2 min read
New research: are prompting and activation steering just two sides of the same coin?

@EricBigelow @danielwurgaft @EkdeepL and coauthors argue they are: ICL and steering have formally equivalent effects. (1/4) Image The paper formalizes a Bayesian framework for model control: altering a model's "beliefs" over which persona or data source it's emulating.

Context (prompting) and internal representations (steering) offer dual mechanisms to alter those "beliefs". (2/4)
Nov 6, 2025 7 tweets 3 min read
LLMs memorize a lot of training data, but memorization is poorly understood.

Where does it live inside models? How is it stored? How much is it involved in different tasks?

@jack_merullo_ & @srihita_raju's new paper examines all of these questions using loss curvature! (1/7) Image The method is like PCA, but for loss curvature instead of variance: it decomposes weight matrices into components ordered by curvature, and removes the long tail of low-curvature ones.

What's left are the weights that most affect loss across the training set. (2/7)
Oct 29, 2025 6 tweets 2 min read
Why use LLM-as-a-judge when you can get the same performance for 15–500x cheaper?

Our new research with @RakutenGroup on PII detection finds that SAE probes:
- transfer from synthetic to real data better than normal probes
- match GPT-5 Mini performance at 1/15 the cost

(1/6) Image PII detection in production AI systems requires methods which are very lightweight, have high recall, and perform well after training on only synthetic data (can't train on customer PII!)

These constraints mean many approaches don't work well. (2/6)
Jun 28, 2025 7 tweets 2 min read
(1/7) New research: how can we understand how an AI model actually works? Our method, SPD, decomposes the *parameters* of neural networks, rather than their activations - akin to understanding a program by reverse-engineering the source code vs. inspecting runtime behavior. Image (2/7) Most interp methods study activations. But we want causal, mechanistic stories: "Activations tell you a lot about the dataset, but not as much about the computations themselves. … they’re shadows cast by the computations."
Parameters are where computations actually live.
May 27, 2025 9 tweets 4 min read
We created a canvas that plugs into an image model’s brain.

You can use it to generate images in real-time by painting with the latent concepts the model has learned.

Try out Paint with Ember for yourself 👇


(2/)
Technical blog post: goodfire.ai/blog/painting-…

Try it here: paint.goodfire.ai
Apr 15, 2025 5 tweets 2 min read
What goes on inside the mind of a reasoning model? Today we're releasing the first open-source sparse autoencoders (SAEs) trained on DeepSeek's 671B parameter reasoning model, R1—giving us new tools to understand and steer model thinking.

Why does this matter? Image (2/) Reasoning models like DeepSeek R1, OpenAI’s o3, and Anthropic’s Claude 3.7 are changing how we use AI, providing more reliable and coherent responses for complex problems. But understanding their internal mechanisms remains challenging.
Feb 19, 2025 8 tweets 3 min read
We are excited to announce our collaboration with @arcinstitute on their state-of-the-art biological foundation model, Evo 2. Our work reveals how models like Evo 2 process biological information - from DNA to proteins - in ways we can now decode. Image (2/) Today, Arc announced Evo 2, a next-generation biological model that processes million-base-pair sequences at nucleotide resolution. Trained across all life forms, it predicts and generates complex biological sequences. You can preview our interpretability work in the Evo 2 preprint and our blog post.

Announcement: x.com/arcinstitute/s…
Blog post: goodfire.ai/blog/interpret…
Preprint: arcinstitute.org/manuscripts/Ev…