Latest Twitter Threads by @thesubhashk on Thread Reader App

Feb 6 • 7 tweets • 3 min read

We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language.

We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵

We trained our AOs to give holistic descriptions of activations (rather than answer specific questions).

For example, Opus solves this math problem correctly, then switches to an incorrect answer. Why? Our AO shows Opus recognized the problem and memorized an (incorrect!) answer

Feb 5, 2025 • 15 tweets • 6 min read

(1/N) LLMs represent numbers on a helix? And use trigonometry to do addition? Answers below 🧵

(2/N) TLDR: We aim to reverse engineer LLM addition. We find that three mid-sized LLMs represent numbers on a helix and use helices to do addition; i.e. calculating a+b by doing cos(a),cos(b)->cos(a+b). We call this the Clock algorithm, because it’s like adding angles on a clock!

May 28, 2024 • 6 tweets • 3 min read

How do transformers do physics? Do they use interpretable human methods or invent “alien physics” that’s hard to decipher?

We find evidence for the former, showing that transformers model the simple harmonic oscillator (SHO) with known numerical methods!🧵@ZimingLiu11 @tegmark

(1/N) While human physicists prefer to model systems with analytical solutions, we hypothesize transformers use numerical methods to model the SHO.
To show evidence for the use of a method, we apply criteria based on the internal encoding of relevant intermediate quantities:

Share this page!

Enter URL or ID to Unroll