Introducing Silico: the platform for building AI models with the precision of written software.
Silico lets researchers and engineers see inside their models, debug failures, and intentionally design them from the ground up.
Early access is open now. 🧵(1/10)
We’ve used interpretability to discover a novel class of Alzheimer’s biomarkers, teach a language model to correct its own hallucinations, and diagnose performance bottlenecks in a robotics model.
Silico brings those frontier techniques to everyone. (2/10)
Silico introduces our model neuroscientist: an autonomous agent that plans and runs concurrent experiments on your model.
It works with your team in our model design environment, where you can organize research threads, replicate and extend papers, and collaborate on findings.
Here are 5 things you can do with Silico:
(3/10)
See inside your model.
Decompose your model into interpretable features and tell the difference between real understanding and spurious correlation. (4/10)
Check your model's health.
Run comprehensive diagnostics on your model's internal representations to catch issues like undertraining, information bottlenecks, and feature collapse before they impact downstream performance. (5/10)
Debug failures.
Precisely debug issues with model behavior, identify and remove confounders, and diagnose failures before they occur in production. (6/10)
Shape model behavior.
Use internal features to extract stronger predictors, steer generation, and target generalization that standard training can't reach. (7/10)
Generalize from less data.
Target the specific learned structures driving behavior — and shift the training distribution, objective, or architecture to generalize further with the same or less data. (8/10)
MIT Tech Review’s @strwbilly spoke with our CEO/co-founder @ericho_goodfire about Silico and what it means for model builders: (9/10)technologyreview.com/2026/04/30/113…
We achieved state-of-the-art performance in predicting which of 4.2 million genetic variants cause diseases by interpreting a genomics model, in a new preprint with @MayoClinic.
We're now releasing an open source database for all variants in the NIH's clinvar database. 🧵(1/8)
The core challenge of genomic medicine is figuring out what genetic variants actually do - so we can diagnose and treat resulting diseases. But many of the millions of variants in clinical databases are still “variants of uncertain significance” (VUS). (2/8)
By training our new covariance probes on @arcinstitute’s Evo 2 - a genomic foundation model trained on massive DNA data - we achieve state-of-the-art prediction of whether variants cause disease, with strong generalization across variant types. (3/8)
We've identified a novel class of biomarkers for Alzheimer's detection - using interpretability - with @PrimaMente.
How we did it, and how interpretability can power scientific discovery in the age of digital biology: (1/6)
Bio foundation models (e.g. AlphaFold) can achieve superhuman performance, so they must contain novel scientific knowledge. @PrimaMente's Pleiades epigenetics model is one such case - it's SOTA on early Alzheimer's detection.
But that knowledge is locked inside a black box (2/6)
Interpretability is the key to unlock that knowledge, extracting what the Pleiades model knows about epigenetics and Alzheimer's (or anything else!)
It's the missing step between black-box predictive power and true scientific understanding. (3/6)
LLMs memorize a lot of training data, but memorization is poorly understood.
Where does it live inside models? How is it stored? How much is it involved in different tasks?
@jack_merullo_ & @srihita_raju's new paper examines all of these questions using loss curvature! (1/7)
The method is like PCA, but for loss curvature instead of variance: it decomposes weight matrices into components ordered by curvature, and removes the long tail of low-curvature ones.
What's left are the weights that most affect loss across the training set. (2/7)
Why use LLM-as-a-judge when you can get the same performance for 15–500x cheaper?
Our new research with @RakutenGroup on PII detection finds that SAE probes:
- transfer from synthetic to real data better than normal probes
- match GPT-5 Mini performance at 1/15 the cost
(1/6)
PII detection in production AI systems requires methods which are very lightweight, have high recall, and perform well after training on only synthetic data (can't train on customer PII!)
These constraints mean many approaches don't work well. (2/6)
But probes on a small sidecar model are a perfect candidate, and we tested several probe variants trained on synthetic data.
Surprisingly, a random forest SAE probe performs the best by far on prod test data - leading Rakuten to deploy it to their agent platform. (3/6)
(1/7) New research: how can we understand how an AI model actually works? Our method, SPD, decomposes the *parameters* of neural networks, rather than their activations - akin to understanding a program by reverse-engineering the source code vs. inspecting runtime behavior.
(2/7) Most interp methods study activations. But we want causal, mechanistic stories: "Activations tell you a lot about the dataset, but not as much about the computations themselves. … they’re shadows cast by the computations."
Parameters are where computations actually live.
(3/7) Our new method, Stochastic Parameter Decomposition (SPD), breaks down a model’s parameters - finding minimal, simple subcomponents that faithfully recreate the original model's behavior.