Oscar Gilg Profile picture
Research fellow @ @MATSprogram. Prev Trader @ Optiver, Oxford Maths&CS 2023. 🇫🇷🇬🇧
May 18 11 tweets 3 min read
First preprint! Working with @patrickbutlin during @MATSprogram.
LLM Assistant personas like being helpful, evil personas like being harmful. We found that a single direction represents helping as good under the Assistant, and ‘harm’ as good under evil. Image We train a linear probe to predict pairwise task choices in Gemma-3-27B and Qwen-3.5-122B, and find that the resulting direction behaves like a preference vector. Image