The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.
We hope this will eventually enable us to diagnose failure modes, design fixes, and certify that models are safe for adoption by enterprises and society. It's much easier to tell if something is safe if you can understand how it works!
Most neurons in language models are "polysemantic" – they respond to multiple unrelated things. For example, one neuron in a small language model activates strongly on academic citations, English dialogue, HTTP requests, Korean text, and others.
Last year, we conjectured that polysemanticity is caused by "superposition" – models compressing many rare concepts into a small number of neurons. We also conjectured that "dictionary learning" might be able to undo superposition.
Dictionary learning works! Using a "sparse autoencoder", we can extract features that represent purer concepts than neurons do. For example, turning ~500 neurons into ~4000 features uncovers things like DNA sequences, HTTP requests, and legal text.
Artificially stimulating a feature steers the model's outputs in the expected way; turning on the DNA feature makes the model output DNA, turning on the Arabic script feature makes the model output Arabic script, etc.
Features connect in "finite-state automata"-like systems that implement complex behaviors. For example, we find features that work together to generate valid HTML.
We also systematically show that the features we find are more interpretable than the neurons, using both a blinded human evaluator and a large language model (autointerpretability).
Our research builds on work on sparse coding and distributed representations in neuroscience, disentanglement and dictionary learning in machine learning, and compressed sensing in mathematics.
There's a lot more in the paper if you're interested, including universality, "feature splitting", more evidence for the superposition hypothesis, and tips for training a sparse autoencoder to better understand your own network!
Large language models have demonstrated a surprising range of skills and behaviors. How can we trace their source? In our new paper, we use influence functions to find training examples that contribute to a given model output.
Influence functions are a classic technique from statistics. They are formulated as a counterfactual: if a copy of a given training sequence were added to the dataset, how would that change the trained parameters (and, by extension, the model’s outputs)?
Directly evaluating this counterfactual by re-training the model would be prohibitively expensive, so we’ve developed efficient algorithms that let us approximate influence functions for LLMs with up to 52 billion parameters: arxiv.org/abs/2308.03296
When language models “reason out loud,” it’s hard to know if their stated reasoning is faithful to the process the model actually used to make its prediction. In two new papers, we measure and improve the faithfulness of language models’ stated reasoning.
We make edits to the model’s chain of thought (CoT) reasoning to test hypotheses about how CoT reasoning may be unfaithful. For example, the model’s final answer should change when we introduce a mistake during CoT generation.
For some tasks, forcing the model to answer with only a truncated version of its chain of thought often causes it to come to a different answer, indicating that the CoT isn’t just a rationalization. The same is true when we introduce mistakes into the CoT.
Introducing Claude 2! Our latest model has improved performance in coding, math and reasoning. It can produce longer responses, and is available in a new public-facing beta website at in the US and UK. https://t.co/jSkvbXnqLdclaude.ai
Claude 2 has improved from our previous models on evaluations including Codex HumanEval, GSM8K, and MMLU. You can see the full suite of evaluations in our model card: https://t.co/LLOuUNfOFVwww-files.anthropic.com/production/ima…
Although no model is immune to jailbreaks, we’ve used Constitutional AI and automated red-teaming to make Claude 2 more harmless and harder to prompt to produce offensive or dangerous output.
We develop a method to test global opinions represented in language models. We find the opinions represented by the models are most similar to those of the participants in USA, Canada, and some European countries. We also show the responses are steerable in separate experiments.
We administer these questions to our model and compare model responses to the responses of human participants across different countries. We release our evaluation dataset at: https://t.co/vLj27i7Fvqhuggingface.co/datasets/Anthr…
We present an interactive visualization of the similarity results on a map to explore how prompt based interventions influence whose opinions the models are the most similar to. llmglobalvalues.anthropic.com
We collaborated with @compdem to research the opportunities and risks of augmenting the platform with language models (LMs) to facilitate open and constructive dialogue between people with diverse viewpoints. https://t.co/Fo8S1aqJNKPol.is
We analyzed a 2018 conversation run in Bowling Green, Kentucky when the city was deeply divided on national issues. @compdem, academics, local media, and expert facilitators used https://t.co/5gopxi9woV to identify consensus areas. https://t.co/NO8Wbk5EcJPol.is Pol.is compdemocracy.org/Case-studies/2…
We find evidence that LMs have promising potential to help human facilitators and moderators synthesize the outcomes of online digital town halls—a role that requires significant expertise in quantitative & qualitative data analysis, the topic of debate, and writing skills.
Introducing 100K Context Windows! We’ve expanded Claude’s context window to 100,000 tokens of text, corresponding to around 75K words. Submit hundreds of pages of materials for Claude to digest and analyze. Conversations with Claude can go on for hours or days.
We fed Claude-Instant The Great Gatsby (72K tokens), except we modified one line to say that Mr. Carraway was "a software engineer that works on machine learning tooling at Anthropic." We asked the model to spot what was added - it responded with the right answer in 22 seconds.
Claude can help retrieve information from business documents. Drop multiple documents or even a book into the prompt and ask Claude questions that require synthesis of knowledge across many parts of the text.