Large language models have demonstrated a surprising range of skills and behaviors. How can we trace their source? In our new paper, we use influence functions to find training examples that contribute to a given model output.
Influence functions are a classic technique from statistics. They are formulated as a counterfactual: if a copy of a given training sequence were added to the dataset, how would that change the trained parameters (and, by extension, the model’s outputs)?
Directly evaluating this counterfactual by re-training the model would be prohibitively expensive, so we’ve developed efficient algorithms that let us approximate influence functions for LLMs with up to 52 billion parameters: arxiv.org/abs/2308.03296
Identifying the most influential training sequences revealed that generalization patterns become much more sophisticated and abstract with scale. For example, here are the most influential sequences for 810 million and 52 billion parameter models for a math word problem:
Here is another example of increasing abstraction with scale, where an AI Assistant reasoned through an AI alignment question. The top influential sequence for the 810M model shares a short phrase with the query, while the one for the 52B model is more thematically related.
Another striking example occurs in cross-lingual influence. We translated an English language query into Korean and Turkish, and found that the influence of the English sequences on the translated queries is near-zero for the smallest model but very strong for the largest one.
Influence functions can also help understand role-playing behavior. Here are examples where an AI Assistant role-played misaligned AIs. Top influential sequences come largely from science fiction and AI safety articles, suggesting imitation (but at an abstract level).
The influence distributions are heavy-tailed, with the tail approximately following a power law. Most influence is concentrated in a small fraction of training sequences. Still, the influences are diffuse, with any particular sequence only slightly influencing the final outputs.
Influence can also be attributed to particular training tokens and network layers. On average, the influence is equally distributed over all layers (so the common heuristic of computing influence only over the output layer is likely to miss important generalization patterns).
On the other hand, individual influence queries show distinct influence patterns. The bottom and top layers seem to focus on fine-grained wording while middle layers reflect higher-level semantic information. (Here, rows correspond to layers and columns correspond to sequences.)
This work is just the beginning. We hope to analyze the interactions between pretraining and finetuning, and combine influence functions with mechanistic interpretability to reverse engineer the associated circuits. You can read more on our blog: anthropic.com/index/influenc…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
How is AI changing work inside Anthropic? And what might this tell us about the effects on the wider labor force to come?
We surveyed 132 of our engineers, conducted 53 in-depth interviews, and analyzed 200K internal Claude Code sessions to find out. anthropic.com/research/how-a…
Our workplace is undergoing significant changes.
Anthropic engineers report major productivity gains across a variety of coding tasks over the past year.
Claude has expanded what Anthropic staff can do: Engineers are tackling work outside their usual expertise; researchers are creating front-ends for data visualization; non-technical staff are using Claude for data science and debugging Git issues.
New Anthropic research: Estimating AI productivity gains from Claude conversations.
The Anthropic Economic Index tells us where Claude is used, and for which tasks. But it doesn’t tell us how useful Claude is. How much time does it save?
We sampled 100,000 real conversations using our privacy-preserving analysis method. Then, Claude estimated the time savings with AI for each conversation.
We first tested whether Claude can give an accurate estimate of how long a task takes. Its estimates were promising—even if they’re not as accurate as those from humans just yet.
New Anthropic research: Signs of introspection in LLMs.
Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.
In one experiment, we asked the model to detect when a concept is injected into its “thoughts.” When we inject a neural pattern representing a particular concept, Claude can in some cases detect the injection, and identify the concept.
Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception.
Now we’re open-sourcing the tool to run those audits.
It’s called Petri: Parallel Exploration Tool for Risky Interactions. It uses automated agents to audit models across diverse scenarios.
Describe a scenario, and Petri handles the environment simulation, conversations, and analyses in minutes.
Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.
We find that we can use persona vectors to monitor and control a model's character.
Our pipeline is completely automated. Just describe a trait, and we’ll give you a persona vector. And once we have a persona vector, there’s lots we can do with it…