"celeste from celeste-land"
interp @ MATS 10.0 (@NeelNanda5 stream)
vegan ea computerwoman and prolific blogcel
Jun 4 • 15 tweets • 5 min read
New research from @japhba and I!
Activation Oracles are a pretty cool interpretability tool. They answer natural questions about activations, but they suffer from vagueness and hallucinations. Can AO training be improved?
Turns out: Yes! We identify four fixes that make AOs substantially more useful!
An Activation Oracle is a finetuned LLM that takes another model's activations as input and answers natural language questions about them. Great for open-ended questions like "why is the model about to backtrack?"
The problem is that the original Activation Oracles hallucinate very often, are vague, and they're hard to eval.