How to get URL link on X (Twitter) App
We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.
It’s called Petri: Parallel Exploration Tool for Risky Interactions. It uses automated agents to audit models across diverse scenarios.
We find that we can use persona vectors to monitor and control a model's character.
The program will run for ~two months, with opportunities to extend for an additional four based on progress and performance.
Claude Code has seen unprecedented demand, especially as part of our Max plans.
We tested whether LLMs comply more with requests when they know they’re being trained versus unmonitored.https://x.com/AnthropicAI/status/1869427646368792599
We all know vending machines are automated, but what if we allowed an AI to run the entire business: setting prices, ordering inventory, responding to customer requests, and so on?
We mentioned this in the Claude 4 system card and are now sharing more detailed research and transcripts.
Claude Opus 4 and Sonnet 4 are hybrid models offering two modes: near-instant responses and extended thinking for deeper reasoning.
Students most commonly used Claude to create and improve educational content (39.3% of conversations) and to provide technical explanations or solutions (33.5%).
We slipped problem-solving hints to Claude 3.7 Sonnet and DeepSeek R1, then tested whether their Chains-of-Thought would mention using the hint (if the models actually used it).
The data for this second report are from after the release of Claude 3.7 Sonnet. For this new model, we find a small rise in the share of usage for coding, as well as educational, science, and healthcare applications.
We often assess AI safety by checking for harmful behaviors. But this can fail: AIs may subtly misbehave or act “right for the wrong reasons,” risking unexpected failures.
Pairing our unique data with privacy-preserving analysis, we mapped millions of conversations to tasks and associated occupations.
Like all LLMs, Claude is vulnerable to jailbreaks—inputs designed to bypass its safety training and force it to produce outputs that might be harmful.