Wes Gurnee Profile picture
Oct 21 15 tweets 6 min read Read on X
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms! Image
The task is simply when to break a line in fixed width text. This requires the model to in-context learn the line width constraint, state track the characters in the current line, compute the characters remaining, and determine if the next word fits! Image
To get oriented, we made an attribution graph which surfaces features for each of the important parts of the computation: different counting features for each type of count, the natural next word, and a feature for detecting the next word exceeds the characters remaining. Image
When looking at a broader dataset, we see larger feature families, like character counting features for the current character position in a line! These features are universal across dictionaries and resemble place cells in brains and curve detectors in vision networks. Image
These counting features discretize a continuous counting manifold. The “rippled” representations are optimal with respect to balancing capacity and resolution, and have elegant connections to Fourier features and space-filling curves Image
How are these representations used? The character count manifold and line width manifold are aligned in the residual stream, but get “twisted” by an attention head, to start attending to the previous newline just *before* the end of the current line Image
Multiple “boundary heads” twist the manifolds with different offsets. The collective action of these heads computes the number of characters remaining in the line with high precision, loosely akin to stereoscopic vision. Image
Image
At the end of the model, there is a merging of paths between characters remaining and the next word to determine if it will fit. Image
In addition to boundary detector features (which activate irrespective of the next token), there are families of break prediction and break suppressor features which activate if the next word will fit or not! Image
How is this implemented? The model arranges the count of characters remaining and characters in the next word in orthogonal subspaces. This makes the decision to break linearly separable! Image
But how did the model compute the character count in the first place? There is a distributed attention algorithm, with at least 11 heads over 2 layers to accumulate the count. Each head “specializes” to contribute a particular part of the count. Image
Individually, each head implements a heuristic to convert a token count into a character count followed by a length adjustment. Similar to boundary detection, heads use variable size attention sinks to tile the space, with matching OV weights. Image
We learned a lot from this project: empirical evidence of previously hypothesized rippled manifolds, the duality of features and geometry, and how features sometimes incur a “complexity tax”. We would be excited about other deep case studies on other naturalistic tasks! Image
One of the hardest parts of finishing this paper was deciding what to call it! The memes really write themselves… We included some of our favorite rejected titles in the appendix Image
For full details, you can read the paper here:

Amazing to work with @mlpowered, @ikauvar, @tarngerine, @adamrpearce, @ch402, @thebasepoint!transformer-circuits.pub/2025/linebreak…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Wes Gurnee

Wes Gurnee Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @wesg52

Oct 4, 2023
Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales?

In a new paper with @tegmark we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2!
For spatial representations, we run Llama-2 models on the names of tens of thousands cities, structures, and natural landmarks around the world, the USA, and NYC. We then train linear probes on the last token activations to predict the real latitude and longitudes of each place.
For temporal representations, we run the models on the names of famous figures from the past 3000 years, the names of songs, movies and books from 1950 onward, and NYT headlines from the 2010s and train lin probes to predict the year of death, release date, and publication date. Image
Read 10 tweets
May 3, 2023
Neural nets are often thought of as feature extractors. But what features are neurons in LLMs actually extracting? In our new paper, we leverage sparse probing to find out arxiv.org/abs/2305.01610. A 🧵: Image
One large family of neurons we find are “context” neurons, which activate only for tokens in a particular context (French, Python code, US patent documents, etc). When deleting these neurons the loss increases in the relevant context while leaving other contexts unaffected! Image
But what if there are more features than there are neurons? This results in polysemantic neurons which fire for a large set of unrelated features. Here we show a single early layer neuron which activates for a large collection of unrelated n-grams. Image
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(