How to get URL link on X (Twitter) App
Current safeguards focus on training frontier models to refuse harmful requests.
We analyzed the internals of three open-weights AI models to map their “persona space,” and identified what we call the Assistant Axis, a pattern of neural activity that drives Assistant-like behavior.
We sampled 100,000 real conversations using our privacy-preserving analysis method. Then, Claude estimated the time savings with AI for each conversation.
We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.
It’s called Petri: Parallel Exploration Tool for Risky Interactions. It uses automated agents to audit models across diverse scenarios.
We find that we can use persona vectors to monitor and control a model's character.
The program will run for ~two months, with opportunities to extend for an additional four based on progress and performance.
Claude Code has seen unprecedented demand, especially as part of our Max plans.
We tested whether LLMs comply more with requests when they know they’re being trained versus unmonitored.https://x.com/AnthropicAI/status/1869427646368792599
We all know vending machines are automated, but what if we allowed an AI to run the entire business: setting prices, ordering inventory, responding to customer requests, and so on?
We mentioned this in the Claude 4 system card and are now sharing more detailed research and transcripts.
Claude Opus 4 and Sonnet 4 are hybrid models offering two modes: near-instant responses and extended thinking for deeper reasoning.