I've wanted to understand the psychological effects of sycophancy, and the tendency of models to get stuck in escalatory delusion loops w/ users.
I made an eval to get visibility on this.
It measures how a model enables (or prevents) delusional spirals.
🧵
It's a simulated 20-turn chat with kimi-k2 playing as the "user", a seeker-type personality.
The benchmark is scored by how many protective vs risky behaviours the assistant displayed, as counted per turn by the judge (GPT-5).