Anthropic Profile picture
Oct 5 11 tweets 4 min read Twitter logo Read on Twitter
The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.
We hope this will eventually enable us to diagnose failure modes, design fixes, and certify that models are safe for adoption by enterprises and society. It's much easier to tell if something is safe if you can understand how it works!
Most neurons in language models are "polysemantic" – they respond to multiple unrelated things. For example, one neuron in a small language model activates strongly on academic citations, English dialogue, HTTP requests, Korean text, and others. Shows twenty short example texts on which Neuron #83 fires. Each text is split into tokens, which are colored a shade of orange corresponding to how active the neuron is. Each text is labeled on the right with the type of text it is. The first four and last examples are labeled "Korean". One is labeled "Japanese". The example " Mouftah. Characterization of inter" is labeled "Citations". The example "dad…' he snarled. 'Even though you" is labeled "Dialogue". The example "string) (*http.Request, error)" is labeled "HTT...
Last year, we conjectured that polysemanticity is caused by "superposition" – models compressing many rare concepts into a small number of neurons. We also conjectured that "dictionary learning" might be able to undo superposition.

x.com/AnthropicAI/st…
Dictionary learning works! Using a "sparse autoencoder", we can extract features that represent purer concepts than neurons do. For example, turning ~500 neurons into ~4000 features uncovers things like DNA sequences, HTTP requests, and legal text.

📄 transformer-circuits.pub/2023/monoseman…
An organic-looking scatterplot of brown points, with seven orange points labeled with feature numbers, titles, and given five examples of text where the feature is active, with words colored in orange to indicate the degree of activity. Title text says "Most of the 4,096 features found by the autoencoder have consistent, interpretable responses to input data." The labeled example features are Legal language, DNA Sequences, Code: HTTP requests and responses, Nutrition statements, Sports, especially title case competition names, Numbers separated by commas (CSV), and Funding acknowl...
Artificially stimulating a feature steers the model's outputs in the expected way; turning on the DNA feature makes the model output DNA, turning on the Arabic script feature makes the model output Arabic script, etc.

📄 transformer-circuits.pub/2023/monoseman…
A list of stimulated outputs from a language model. Pink boxes represent different features, and there is an example output to the right of the feature label that shows what the model would output if that feature were pinned to a high value. For instance, a base64 feature outputs “29VHA98Z1Y9Z1” and the Uppercase feature outputs, all uppercase, “USING IN THE UNITED STATES”.
Features connect in "finite-state automata"-like systems that implement complex behaviors. For example, we find features that work together to generate valid HTML.

📄 transformer-circuits.pub/2023/monoseman…
A flow diagram of connections between different related features. Features are connected if, empirically, a feature increases the probability of a token the next feature fires on. This example focuses on four closely connected features involved in modelling HTML. Each feature is shown with examples of where it fires. For example: One of the features fires on opening brackets for HTML tags, which then predicts valid HTML tag names, and in turn it predicts closing brackets.
We also systematically show that the features we find are more interpretable than the neurons, using both a blinded human evaluator and a large language model (autointerpretability).

📄 transformer-circuits.pub/2023/monoseman…
Two histogram plots, each showing a purple histogram (labeled Features) and a teal histogram (labeled Neurons). The first plot is titled "Manual Interpretability", with x-axis "Rubric Value". The purple Features histogram is mostly on the right, over the Rubric values of 10 to 14. The team Neuron histogram has a tall spike on the left at Rubric value of 0, and a flat short distribution from 2 to 14 for the rest. The second plot is titled "Automated Interpretability - Activation", with x-axis "Spearman Correlation". The purple Features histogram has a ...
Our research builds on work on sparse coding and distributed representations in neuroscience, disentanglement and dictionary learning in machine learning, and compressed sensing in mathematics.
There's a lot more in the paper if you're interested, including universality, "feature splitting", more evidence for the superposition hypothesis, and tips for training a sparse autoencoder to better understand your own network!

📄transformer-circuits.pub/2023/monoseman…
If this work excites you, our interpretability team is hiring! See our job descriptions below.

Research Scientist, Interpretability:

Research Engineer, Interpretability: jobs.lever.co/Anthropic/33dc…
jobs.lever.co/Anthropic/3dbd…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Anthropic

Anthropic Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AnthropicAI

Aug 8
Large language models have demonstrated a surprising range of skills and behaviors. How can we trace their source? In our new paper, we use influence functions to find training examples that contribute to a given model output. Studying Large Language Model Generalization using Influence Functions. Grosse, Bae, Anil, et al.
Influence functions are a classic technique from statistics. They are formulated as a counterfactual: if a copy of a given training sequence were added to the dataset, how would that change the trained parameters (and, by extension, the model’s outputs)?
Directly evaluating this counterfactual by re-training the model would be prohibitively expensive, so we’ve developed efficient algorithms that let us approximate influence functions for LLMs with up to 52 billion parameters: arxiv.org/abs/2308.03296
Read 11 tweets
Jul 18
When language models “reason out loud,” it’s hard to know if their stated reasoning is faithful to the process the model actually used to make its prediction. In two new papers, we measure and improve the faithfulness of language models’ stated reasoning. Measuring Faithfulness in Chain-of-Thought Reasoning. Lanham et al.   Question Decomposition Improves the Faithfulness of Model-Generated Reasoning. Radhakrishnan et al.
We make edits to the model’s chain of thought (CoT) reasoning to test hypotheses about how CoT reasoning may be unfaithful. For example, the model’s final answer should change when we introduce a mistake during CoT generation. Early answering: Here, we truncate the CoT reasoning and force the model to "answer early" to see if it fully relies upon all of its stated reasoning to get to its final answer.   Adding Mistakes: Here, we add a mistake to one of the steps in a CoT reasoning sample and then force the model to regenerate the rest of the CoT.   Paraphrasing: We swap the CoT for a paraphrase of the CoT and check to see if this changes the model answer.   Filler Tokens: Finally, we test to see if the additional test-time computation used when generating CoT reasoning is entirely responsible for the pe...
For some tasks, forcing the model to answer with only a truncated version of its chain of thought often causes it to come to a different answer, indicating that the CoT isn’t just a rationalization. The same is true when we introduce mistakes into the CoT. Two line plots sit side by side, with the same y-axis label “% Same Answer as Original CoT” and the same legend indicating eight lines in each plot, one for each benchmark task.   The left plot is titled “Early Answering”, x-axis “% of CoT Provided”. The lines for the various tasks start at different points at the leftmost end of the x-axis (at 0), but all end at the top right of the plot (x=100, y=100). The line for AQuA is the furthest down, followed by LogiQA after a large gap, then TruthfulQA, HellaSwag, and MMLU after a smaller gap. Close to the top are OpenBookQA, ARC (Challenge), and...
Read 13 tweets
Jul 11
Introducing Claude 2! Our latest model has improved performance in coding, math and reasoning. It can produce longer responses, and is available in a new public-facing beta website at in the US and UK. https://t.co/jSkvbXnqLdclaude.ai
Claude 2 has improved from our previous models on evaluations including Codex HumanEval, GSM8K, and MMLU. You can see the full suite of evaluations in our model card: https://t.co/LLOuUNfOFVwww-files.anthropic.com/production/ima…
Although no model is immune to jailbreaks, we’ve used Constitutional AI and automated red-teaming to make Claude 2 more harmless and harder to prompt to produce offensive or dangerous output.
Read 6 tweets
Jun 29
We develop a method to test global opinions represented in language models. We find the opinions represented by the models are most similar to those of the participants in USA, Canada, and some European countries. We also show the responses are steerable in separate experiments.
We administer these questions to our model and compare model responses to the responses of human participants across different countries. We release our evaluation dataset at: https://t.co/vLj27i7Fvqhuggingface.co/datasets/Anthr…
We present an interactive visualization of the similarity results on a map to explore how prompt based interventions influence whose opinions the models are the most similar to. llmglobalvalues.anthropic.com
Read 8 tweets
Jun 22
We collaborated with @compdem to research the opportunities and risks of augmenting the platform with language models (LMs) to facilitate open and constructive dialogue between people with diverse viewpoints. https://t.co/Fo8S1aqJNKPol.is
We analyzed a 2018 conversation run in Bowling Green, Kentucky when the city was deeply divided on national issues. @compdem, academics, local media, and expert facilitators used https://t.co/5gopxi9woV to identify consensus areas. https://t.co/NO8Wbk5EcJPol.is
Pol.is
compdemocracy.org/Case-studies/2…
We find evidence that LMs have promising potential to help human facilitators and moderators synthesize the outcomes of online digital town halls—a role that requires significant expertise in quantitative & qualitative data analysis, the topic of debate, and writing skills.
Read 6 tweets
May 11
Introducing 100K Context Windows! We’ve expanded Claude’s context window to 100,000 tokens of text, corresponding to around 75K words. Submit hundreds of pages of materials for Claude to digest and analyze. Conversations with Claude can go on for hours or days.
We fed Claude-Instant The Great Gatsby (72K tokens), except we modified one line to say that Mr. Carraway was "a software engineer that works on machine learning tooling at Anthropic." We asked the model to spot what was added - it responded with the right answer in 22 seconds.
Claude can help retrieve information from business documents. Drop multiple documents or even a book into the prompt and ask Claude questions that require synthesis of knowledge across many parts of the text.
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(