Jack Merullo Profile picture
Interpretability @GoodfireAI was a Phd @BrownUniversity
Nov 6, 2025 12 tweets 4 min read
How is memorized data stored in a model? We disentangle MLP weights in LMs and ViTs into rank-1 components based on their curvature in the loss, and find representational signatures of both generalizing structure and memorized training data Image The paper is here: Our weight decomposition lets us suppress memorized data and find regularities in weights that connect to tasks like logical reasoning, fact retrieval, and math.arxiv.org/abs/2510.24256
Aug 8, 2025 9 tweets 3 min read
Could we tell if gpt-oss was memorizing its training data? I.e., points where it’s reasoning vs reciting? We took a quick look at the curvature of the loss landscape of the 20B model to understand memorization and what’s happening internally during reasoning Image The curvature of the loss wrt an input embedding tells you how fast the loss would move as you changed the input. “Sharp” areas where the loss changes quickly tell us that maybe that input is memorized
Oct 3, 2022 4 tweets 2 min read
Are language models (LMs) good models of the visual world? We show that without explicit grounding, LMs can directly use linear projections of image representations as soft prompts for vision-language (VL) tasks. This can be done without tuning the LM or image encoder! But, we also find that performance on VL tasks depends on how much linguistic supervision pretraining the image encoder has. Eg CLIP is pretrained with full language descriptions, NF-ResNET is trained with lexical category information (imagenet labels), and BEIT is vision-only