First preprint! Working with @patrickbutlin during @MATSprogram.
LLM Assistant personas like being helpful, evil personas like being harmful. We found that a single direction represents helping as good under the Assistant, and ‘harm’ as good under evil.
We train a linear probe to predict pairwise task choices in Gemma-3-27B and Qwen-3.5-122B, and find that the resulting direction behaves like a preference vector.