Wes Gurnee Profile picture
Trying to read Claude’s mind. Interpretability at @AnthropicAI Prev: Optimizer @MIT, Byte-counter @Google
Oct 21 15 tweets 6 min read
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms! Image The task is simply when to break a line in fixed width text. This requires the model to in-context learn the line width constraint, state track the characters in the current line, compute the characters remaining, and determine if the next word fits! Image
Oct 4, 2023 10 tweets 4 min read
Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales?

In a new paper with @tegmark we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2! For spatial representations, we run Llama-2 models on the names of tens of thousands cities, structures, and natural landmarks around the world, the USA, and NYC. We then train linear probes on the last token activations to predict the real latitude and longitudes of each place.
May 3, 2023 10 tweets 6 min read
Neural nets are often thought of as feature extractors. But what features are neurons in LLMs actually extracting? In our new paper, we leverage sparse probing to find out arxiv.org/abs/2305.01610. A 🧵: Image One large family of neurons we find are “context” neurons, which activate only for tokens in a particular context (French, Python code, US patent documents, etc). When deleting these neurons the loss increases in the relevant context while leaving other contexts unaffected! Image