🧵New Anthropic Fellows research: We studied mechanisms of "introspective awareness" in LLMs.
LLMs can sometimes detect steering vectors injected into their residual stream. But is this worthy of being called introspection, or attributable to some uninteresting confound?👇
We use the setup from Lindsey (2025): inject a steering vector, then ask the model: "Do you detect an injected thought? [detection] If so, what is the injected thought about? [identification]"
Our experiments are on open-source 🤖: Gemma3-27B, OLMo-3.1-32B, and Qwen3-235B.