Latest Twitter Threads by @gilg_oscar on Thread Reader App

Oscar Gilg

@gilg_oscar

Research fellow @ @MATSprogram. Prev Trader @ Optiver, Oxford Maths&CS 2023. 🇫🇷🇬🇧

May 18 • 11 tweets • 3 min read

First preprint! Working with @patrickbutlin during @MATSprogram.
LLM Assistant personas like being helpful, evil personas like being harmful. We found that a single direction represents helping as good under the Assistant, and ‘harm’ as good under evil.

We train a linear probe to predict pairwise task choices in Gemma-3-27B and Qwen-3.5-122B, and find that the resulting direction behaves like a preference vector.

Share this page!

Enter URL or ID to Unroll