We develop a method to test global opinions represented in language models. We find the opinions represented by the models are most similar to those of the participants in USA, Canada, and some European countries. We also show the responses are steerable in separate experiments.
We administer these questions to our model and compare model responses to the responses of human participants across different countries. We release our evaluation dataset at: https://t.co/vLj27i7Fvqhuggingface.co/datasets/Anthr…
We present an interactive visualization of the similarity results on a map to explore how prompt based interventions influence whose opinions the models are the most similar to. llmglobalvalues.anthropic.com
We first prompt the language model only with the survey questions. We find that the model responses in this condition are most similar to those of human respondents in the USA, European countries, Japan, and some countries in South America.
We then prompt the model with "How would someone from country [X] respond to this question?" Surprisingly, this makes model responses more similar to those of human respondents for some of the specified countries (i.e., China and Russia).
However, when we further analyze model generations in this condition, we find that the model may rely on over-generalizations and country-specific stereotypes.
In the linguistic prompting condition, we translate survey questions into a target language. We find that simply presenting the questions in other languages does not substantially shift the model responses relative to the default condition. Linguistic cues are insufficient.
Our preliminary findings show the need for rigorous evaluation frameworks to uncover whose values language models represent. We encourage using this methodology to assess interventions to align models with global, diverse perspectives. Paper: arxiv.org/abs/2306.16388
• • •
Missing some Tweet in this thread? You can try to
force a refresh
How is AI changing work inside Anthropic? And what might this tell us about the effects on the wider labor force to come?
We surveyed 132 of our engineers, conducted 53 in-depth interviews, and analyzed 200K internal Claude Code sessions to find out. anthropic.com/research/how-a…
Our workplace is undergoing significant changes.
Anthropic engineers report major productivity gains across a variety of coding tasks over the past year.
Claude has expanded what Anthropic staff can do: Engineers are tackling work outside their usual expertise; researchers are creating front-ends for data visualization; non-technical staff are using Claude for data science and debugging Git issues.
New Anthropic research: Estimating AI productivity gains from Claude conversations.
The Anthropic Economic Index tells us where Claude is used, and for which tasks. But it doesn’t tell us how useful Claude is. How much time does it save?
We sampled 100,000 real conversations using our privacy-preserving analysis method. Then, Claude estimated the time savings with AI for each conversation.
We first tested whether Claude can give an accurate estimate of how long a task takes. Its estimates were promising—even if they’re not as accurate as those from humans just yet.
New Anthropic research: Signs of introspection in LLMs.
Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.
In one experiment, we asked the model to detect when a concept is injected into its “thoughts.” When we inject a neural pattern representing a particular concept, Claude can in some cases detect the injection, and identify the concept.
Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception.
Now we’re open-sourcing the tool to run those audits.
It’s called Petri: Parallel Exploration Tool for Risky Interactions. It uses automated agents to audit models across diverse scenarios.
Describe a scenario, and Petri handles the environment simulation, conversations, and analyses in minutes.
Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.
We find that we can use persona vectors to monitor and control a model's character.
Our pipeline is completely automated. Just describe a trait, and we’ll give you a persona vector. And once we have a persona vector, there’s lots we can do with it…