Latest Twitter Threads by @uzaymacar on Thread Reader App

Apr 14 • 17 tweets • 6 min read

🧵New Anthropic Fellows research: We studied mechanisms of "introspective awareness" in LLMs.

LLMs can sometimes detect steering vectors injected into their residual stream. But is this worthy of being called introspection, or attributable to some uninteresting confound?👇

We use the setup from Lindsey (2025): inject a steering vector, then ask the model: "Do you detect an injected thought? [detection] If so, what is the injected thought about? [identification]"

Our experiments are on open-source 🤖: Gemma3-27B, OLMo-3.1-32B, and Qwen3-235B.

Share this page!

Enter URL or ID to Unroll