How to get URL link on X (Twitter) App
https://x.com/AnthropicAI/status/2001686747185394148?s=20
Current safeguards focus on training frontier models to refuse harmful requests.
We analyzed the internals of three open-weights AI models to map their “persona space,” and identified what we call the Assistant Axis, a pattern of neural activity that drives Assistant-like behavior.
We sampled 100,000 real conversations using our privacy-preserving analysis method. Then, Claude estimated the time savings with AI for each conversation.
We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.
It’s called Petri: Parallel Exploration Tool for Risky Interactions. It uses automated agents to audit models across diverse scenarios.