Subhash Kantamneni Profile picture
@AnthropicAI. prev @mit Tegmark group. mech interp & alignment
Feb 6 7 tweets 3 min read
We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language.

We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵 Image We trained our AOs to give holistic descriptions of activations (rather than answer specific questions).

For example, Opus solves this math problem correctly, then switches to an incorrect answer. Why? Our AO shows Opus recognized the problem and memorized an (incorrect!) answer Image
Feb 5, 2025 15 tweets 6 min read
(1/N) LLMs represent numbers on a helix? And use trigonometry to do addition? Answers below 🧵 (2/N) TLDR: We aim to reverse engineer LLM addition. We find that three mid-sized LLMs represent numbers on a helix and use helices to do addition; i.e. calculating a+b by doing cos(a),cos(b)->cos(a+b). We call this the Clock algorithm, because it’s like adding angles on a clock! Image
May 28, 2024 6 tweets 3 min read
How do transformers do physics? Do they use interpretable human methods or invent “alien physics” that’s hard to decipher?

We find evidence for the former, showing that transformers model the simple harmonic oscillator (SHO) with known numerical methods!🧵@ZimingLiu11 @tegmark Image (1/N) While human physicists prefer to model systems with analytical solutions, we hypothesize transformers use numerical methods to model the SHO.
To show evidence for the use of a method, we apply criteria based on the internal encoding of relevant intermediate quantities: Image