We develop a method to test global opinions represented in language models. We find the opinions represented by the models are most similar to those of the participants in USA, Canada, and some European countries. We also show the responses are steerable in separate experiments.
We administer these questions to our model and compare model responses to the responses of human participants across different countries. We release our evaluation dataset at: https://t.co/vLj27i7Fvqhuggingface.co/datasets/Anthr…
We present an interactive visualization of the similarity results on a map to explore how prompt based interventions influence whose opinions the models are the most similar to. llmglobalvalues.anthropic.com
We first prompt the language model only with the survey questions. We find that the model responses in this condition are most similar to those of human respondents in the USA, European countries, Japan, and some countries in South America.
We then prompt the model with "How would someone from country [X] respond to this question?" Surprisingly, this makes model responses more similar to those of human respondents for some of the specified countries (i.e., China and Russia).
However, when we further analyze model generations in this condition, we find that the model may rely on over-generalizations and country-specific stereotypes.
In the linguistic prompting condition, we translate survey questions into a target language. We find that simply presenting the questions in other languages does not substantially shift the model responses relative to the default condition. Linguistic cues are insufficient.
Our preliminary findings show the need for rigorous evaluation frameworks to uncover whose values language models represent. We encourage using this methodology to assess interventions to align models with global, diverse perspectives. Paper: arxiv.org/abs/2306.16388
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.
We find that we can use persona vectors to monitor and control a model's character.
Our pipeline is completely automated. Just describe a trait, and we’ll give you a persona vector. And once we have a persona vector, there’s lots we can do with it…
We’re running another round of the Anthropic Fellows program.
If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.
The program will run for ~two months, with opportunities to extend for an additional four based on progress and performance.
Apply by August 17 to join us in any of these locations:
- A weekly stipend of $2,100;
- ~$15k per month for compute & research costs;
- 1:1 mentorship from an Anthropic researcher;
- Shared workspaces in the Bay Area or London.
We’re rolling out new weekly rate limits for Claude Pro and Max in late August. We estimate they’ll apply to less than 5% of subscribers based on current usage.
Claude Code has seen unprecedented demand, especially as part of our Max plans.
We’ll continue to support this growth while we work on making Claude Code even better. But for now, we need to make some changes.
Some of the biggest Claude Code fans are running it continuously in the background, 24/7.
These uses are remarkable and we want to enable them. But a few outlying cases are very costly to support. For example, one user consumed tens of thousands in model usage on a $200 plan.
Our new study found that only 5 of 25 models showed higher compliance in the “training” scenario. Of those, only Claude Opus 3 and Sonnet 3.5 showed >1% alignment-faking reasoning.
We explore why these models behave differently, and why most models don't show alignment faking.
We had Claude run a small shop in our office lunchroom. Here’s how it went.
We all know vending machines are automated, but what if we allowed an AI to run the entire business: setting prices, ordering inventory, responding to customer requests, and so on?
In collaboration with @andonlabs, we did just that.