PhD student at the University of Amsterdam / ILLC, interested in computational linguistics and (mechanistic) interpretability. Current Anthropic Fellow.
May 29 • 11 tweets • 4 min read
@mntssys and I are excited to announce circuit-tracer, a library that makes circuit-finding simple!
Just type in a sentence, and get out a circuit showing (some of) the features your model uses to predict the next token. Try it on @neuronpedia: shorturl.at/SUX2Ax.com/AnthropicAI/st…
Circuit-tracer works by taking in a model and set of transcoders, which break down its internal activations into interpretable features.
We determine the meaning of each feature by looking at the text that makes it activate the strongest - check out the Texas feature below!