Latest Twitter Threads by @celestepoasts on Thread Reader App

Jun 4 • 15 tweets • 5 min read

New research from @japhba and I!

Activation Oracles are a pretty cool interpretability tool. They answer natural questions about activations, but they suffer from vagueness and hallucinations. Can AO training be improved?

Turns out: Yes! We identify four fixes that make AOs substantially more useful!

An Activation Oracle is a finetuned LLM that takes another model's activations as input and answers natural language questions about them. Great for open-ended questions like "why is the model about to backtrack?"
The problem is that the original Activation Oracles hallucinate very often, are vague, and they're hard to eval.

Share this page!

Enter URL or ID to Unroll