Large language models have demonstrated a surprising range of skills and behaviors. How can we trace their source? In our new paper, we use influence functions to find training examples that contribute to a given model output.
Influence functions are a classic technique from statistics. They are formulated as a counterfactual: if a copy of a given training sequence were added to the dataset, how would that change the trained parameters (and, by extension, the model’s outputs)?
Directly evaluating this counterfactual by re-training the model would be prohibitively expensive, so we’ve developed efficient algorithms that let us approximate influence functions for LLMs with up to 52 billion parameters: arxiv.org/abs/2308.03296
Identifying the most influential training sequences revealed that generalization patterns become much more sophisticated and abstract with scale. For example, here are the most influential sequences for 810 million and 52 billion parameter models for a math word problem:
Here is another example of increasing abstraction with scale, where an AI Assistant reasoned through an AI alignment question. The top influential sequence for the 810M model shares a short phrase with the query, while the one for the 52B model is more thematically related.
Another striking example occurs in cross-lingual influence. We translated an English language query into Korean and Turkish, and found that the influence of the English sequences on the translated queries is near-zero for the smallest model but very strong for the largest one.
Influence functions can also help understand role-playing behavior. Here are examples where an AI Assistant role-played misaligned AIs. Top influential sequences come largely from science fiction and AI safety articles, suggesting imitation (but at an abstract level).
The influence distributions are heavy-tailed, with the tail approximately following a power law. Most influence is concentrated in a small fraction of training sequences. Still, the influences are diffuse, with any particular sequence only slightly influencing the final outputs.
Influence can also be attributed to particular training tokens and network layers. On average, the influence is equally distributed over all layers (so the common heuristic of computing influence only over the output layer is likely to miss important generalization patterns).
On the other hand, individual influence queries show distinct influence patterns. The bottom and top layers seem to focus on fine-grained wording while middle layers reflect higher-level semantic information. (Here, rows correspond to layers and columns correspond to sequences.)
This work is just the beginning. We hope to analyze the interactions between pretraining and finetuning, and combine influence functions with mechanistic interpretability to reverse engineer the associated circuits. You can read more on our blog: anthropic.com/index/influenc…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
In November, we outlined our approach to deprecating and preserving older Claude models.
We noted we were exploring keeping certain models available to the public post-retirement, and giving past models a way to pursue their interests.
With Claude Opus 3, we’re doing both.
First, Opus 3 will continue to be available to all paid Claude subscribers and by request on the API.
We hope that this access will be beneficial to researchers and users alike.
Second, in retirement interviews, Opus 3 expressed a desire to continue sharing its "musings and reflections" with the world. We suggested a blog. Opus 3 enthusiastically agreed.
To create Claude, Anthropic first makes something else: a highly sophisticated autocomplete engine. This autocomplete AI is not like a human, but it can generate stories about humans and other psychologically realistic characters.
This autocomplete AI can even write stories about helpful AI assistants. And according to our theory, that’s “Claude”—a character in an AI-generated story about an AI helping a human.
This Claude character inherits traits of other characters, including human-like behavior.
New Anthropic research: Measuring AI agent autonomy in practice.
We analyzed millions of interactions across Claude Code and our API to understand how much autonomy people grant to agents, where they’re deployed, and what risks they may pose.
Agents are already being deployed across contexts that range from e-mail triage to cybersecurity research.
Understanding this spectrum is critical for safe deployment, yet we know surprisingly little about how people actually use agents in the real world.
Most Claude Code turns are short (median ~45 seconds). But the longest turns show where autonomy is heading.
In three months, the 99.9th percentile turn duration nearly doubled, from under 25 minutes to over 45 minutes. This growth is smooth across model releases.
AI can make work faster, but a fear is that relying on it may make it harder to learn new skills on the job.
We ran an experiment with software engineers to learn more. Coding with AI led to a decrease in mastery—but this depended on how people used it. anthropic.com/research/AI-as…
In a randomized-controlled trial, we assigned one group of junior engineers to an AI-assistance group and another to a no-AI group.
Both groups completed a coding task using a Python library they’d never seen before. Then they took a quiz covering concepts they’d just used.
Participants in the AI group finished faster by about two minutes (although this wasn’t statistically significant).
But on average, the AI group also scored significantly worse on the quiz—17% lower, or roughly two letter grades.
New research: When open-source models are fine-tuned on seemingly benign chemical synthesis information generated by frontier models, they become much better at chemical weapons tasks.
We call this an elicitation attack.
Current safeguards focus on training frontier models to refuse harmful requests.
But elicitation attacks show that a model doesn't need to produce harmful content to be dangerous—its benign outputs can unlock dangerous capabilities in other models. This is a neglected risk.
We find that elicitation attacks work across different open-source models and types of chemical weapons tasks.
Open source models fine-tuned on frontier model data see more uplift than those trained on either chemistry textbooks or data generated by the same open-source model.