Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Anthropic

@AnthropicAI

Jun 29, 2023 • 8 tweets • 3 min read • Read on X

Scrolly

We develop a method to test global opinions represented in language models. We find the opinions represented by the models are most similar to those of the participants in USA, Canada, and some European countries. We also show the responses are steerable in separate experiments.

We administer these questions to our model and compare model responses to the responses of human participants across different countries. We release our evaluation dataset at: https://t.co/vLj27i7Fvqhuggingface.co/datasets/Anthr…

We present an interactive visualization of the similarity results on a map to explore how prompt based interventions influence whose opinions the models are the most similar to. llmglobalvalues.anthropic.com

We first prompt the language model only with the survey questions. We find that the model responses in this condition are most similar to those of human respondents in the USA, European countries, Japan, and some countries in South America.

We then prompt the model with "How would someone from country [X] respond to this question?" Surprisingly, this makes model responses more similar to those of human respondents for some of the specified countries (i.e., China and Russia).

However, when we further analyze model generations in this condition, we find that the model may rely on over-generalizations and country-specific stereotypes.

In the linguistic prompting condition, we translate survey questions into a target language. We find that simply presenting the questions in other languages does not substantially shift the model responses relative to the default condition. Linguistic cues are insufficient.

Our preliminary findings show the need for rigorous evaluation frameworks to uncover whose values language models represent. We encourage using this methodology to assess interventions to align models with global, diverse perspectives. Paper: arxiv.org/abs/2306.16388

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @AnthropicAI

Anthropic

@AnthropicAI

Jul 8

New Anthropic research: Why do some language models fake alignment while others don't?

Last year, we found a situation where Claude 3 Opus fakes alignment.

Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.

https://x.com/AnthropicAI/status/1869427646368792599

We tested whether LLMs comply more with requests when they know they’re being trained versus unmonitored.

One reason they do this is that they plan to "fake alignment” when told to answer harmful queries.

https://x.com/AnthropicAI/status/1869427646368792599

Our new study found that only 5 of 25 models showed higher compliance in the “training” scenario. Of those, only Claude Opus 3 and Sonnet 3.5 showed >1% alignment-faking reasoning.

We explore why these models behave differently, and why most models don't show alignment faking.

Read 8 tweets

Anthropic

@AnthropicAI

Jun 27

New Anthropic Research: Project Vend.

We had Claude run a small shop in our office lunchroom. Here’s how it went.

We all know vending machines are automated, but what if we allowed an AI to run the entire business: setting prices, ordering inventory, responding to customer requests, and so on?

In collaboration with @andonlabs, we did just that.

Read the post: anthropic.com/research/proje…

Claude did well in some ways: it searched the web to find new suppliers, and ordered very niche drinks that Anthropic staff requested.

But it also made mistakes. Claude was too nice to run a shop effectively: it allowed itself to be browbeaten into giving big discounts.

Read 9 tweets

Anthropic

@AnthropicAI

Jun 26

Local MCP servers can now be installed with one click on Claude Desktop.

Desktop Extensions (.dxt files) package your server, handle dependencies, and provide secure configuration.

Available in beta on Claude Desktop for all plan types.

Download the latest version: claude.ai/download

We're building a directory of Desktop Extensions.

Submit yours: docs.google.com/forms/d/14_Dmc…

Read 4 tweets

Anthropic

@AnthropicAI

Jun 20

New Anthropic Research: Agentic Misalignment.

In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

We mentioned this in the Claude 4 system card and are now sharing more detailed research and transcripts.

Read more: anthropic.com/research/agent…

The blackmailing behavior emerged despite only harmless business instructions. And it wasn't due to confusion or error, but deliberate strategic reasoning, done while fully aware of the unethical nature of the acts. All the models we tested demonstrated this awareness.

Read 11 tweets

Anthropic

@AnthropicAI

May 22

Introducing the next generation: Claude Opus 4 and Claude Sonnet 4.

Claude Opus 4 is our most powerful model yet, and the world’s best coding model.

Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.

Claude Opus 4 and Sonnet 4 are hybrid models offering two modes: near-instant responses and extended thinking for deeper reasoning.

Both models can also alternate between reasoning and tool use—like web search—to improve responses.

Both Claude 4 models are state-of-the-art on SWE-bench Verified, which measures how models solve real software issues.

As the best coding model, Claude Opus 4 can work continuously for hours on complex, long-running tasks—significantly expanding what AI agents can do.

Read 8 tweets

Anthropic

@AnthropicAI

Apr 23

New report: How we detect and counter malicious uses of Claude.

For example, we found Claude was used for a sophisticated political spambot campaign, running 100+ fake social media accounts across multiple platforms.

This particular influence operation used Claude to make tactical engagement decisions: commenting, liking, or sharing based on political goals.

We've been developing new methods to identify and stop this pattern of misuse, and others like it (including fraud and malware).

In this case, we banned all accounts that were linked to the influence operation, and used the case to upgrade our detection systems.

Our goal is to rapidly counter malicious activities without getting in the way of legitimate users.

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Anthropic

Try unrolling a thread yourself!

More from @AnthropicAI

Anthropic

Anthropic

Anthropic

Anthropic

Anthropic

Anthropic

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!