Chris Olah Profile picture
Reverse engineering neural networks at @AnthropicAI. DMs open! Previously @distillpub, OpenAI Clarity Team, Google Brain. Personal account.
5 subscribers
May 21 14 tweets 4 min read
I'm really excited about these results for many reasons, but the most important is that we're starting to connect mechanistic interpretability to questions about the safety of large language models. I've been working on interpretability for more than a decade, significantly motivated by concerns about safety. But it's always been this aspirational goal – I could tell a story for how this work might someday help, but it was far off.
Oct 5, 2023 18 tweets 5 min read
If you'd asked me a year ago, superposition would have been by far the reason I was most worried that mechanistic interpretability would hit a dead end.

I'm now very optimistic. I'd go as far as saying it's now primarily an engineering problem -- hard, but less fundamental risk. Well trained sparse autoencoders (scale matters!) can decompose a one-layer model into very nice, interpretable features.
Jun 7, 2023 14 tweets 4 min read
One of the ideas I find most useful from @AnthropicAI's Core Views on AI Safety post (anthropic.com/index/core-vie…) is thinking in terms of a distribution over safety difficulty.

Here's a cartoon picture I like for thinking about it: How Hard Is AI Safety? A gr... A lot of AI safety discourse focuses on very specific models of AI and AI safety. These are interesting, but I don't know how I could be confident in any one.

I prefer to accept that we're just very uncertain. One important axis of that uncertainty is roughly "difficulty".
May 18, 2023 5 tweets 2 min read
Very cool to see more parallels between neuroscience and mechanistic interpretability!

There's a long history of us discovering analogs of long-known results in neuroscience -- it's exciting to see a bit of the reverse! (Also neat to see use of interpretability methods!) Image When we were working on vision models, we constantly discovered that features we were discovering such as curve detectors (distill.pub/2020/circuits/…) and multimodal neurons (distill.pub/2021/multimoda…) had existing parallels in neuroscience.
Mar 23, 2023 10 tweets 4 min read
High-Low frequency detectors found in biology!

Very cool to see mechanistic interpretability seeming to predict in advance the existence of biological neuroscience phenomena. 🔬 (Caveat that I'm not super well read in neuroscience. When working on high-low frequency detectors we asked a number of neuroscientists if they'd seen such neurons before, and they hadn't. But very possible something was missed!)
Mar 9, 2023 8 tweets 2 min read
In 1939, physicists discussed voluntarily adopting secrecy in atomic physics. Reading Rhodes' book, one can hear three "camps":

- Caution
- Scientific Humility
- Openness Idealism

I feel like these are the same camps I hear in discussion of AI (eg. open sourcing models). Image I ultimately lean towards caution, but I empathize with all three of these stances.
Jan 5, 2023 13 tweets 5 min read
The former Distill editor in me is quite please with how our experiment with invited comments on Transformer Circuits articles has been turning out! For context, several of our recent articles have had "invited comments" which were highly substantive, curated comments from engaged researchers.
Jan 5, 2023 12 tweets 4 min read
I'm pretty excited about this as a step towards a mechanistic theory of memorization and overfitting. Historically, I think I've often been tempted to think of overfit models as pathological things that need to be trained better in order for us to interpret.

For example, in this paper we found that overfitting seemed to cause less interpretable features: distill.pub/2020/understan…
Jan 1, 2023 9 tweets 2 min read
An idea I've been toying with: The role of many toy models may be to serve as "counterexamples in mechanistic interpretability". For background, there's a famous serious of books in math titled "counterexamples in X" (eg. the famous "Counterexamples in Topology") which offer examples of mathematical objects with unusual properties.
Sep 27, 2022 25 tweets 6 min read
A few people have asked me how our recent superposition paper (transformer-circuits.pub/2022/toy_model…) relates to classical ideas in neural coding / connectionism / distributed representations.

I wanted to share a few thoughts, although I'll caveat I'm very much not an expert on these topics! When I try to read foundational papers on these topics it feels to me like there are actually two different ideas sharing the term "distributed representations". I'll call them "composability" and "superposition".

(These turn out to be opposing properties in some ways!)
Sep 25, 2022 7 tweets 3 min read
After noticing that tomatillos taste suspiciously like golden berries, I spent the last hour staring at phylogenetic trees of edible plants. It turns out that yes, they're related, and there are lots of other interesting relationships.

One nice overview: webdoc.agsci.colostate.edu/cropsforhealth… This blog has a much more detailed discussion of edible plants in different clades botanistinthekitchen.blog/the-plant-food…
Sep 14, 2022 14 tweets 5 min read
I've never had so many "this can't possibly be true, we must have a bug" results in the course of a research project before.

I'd like to take a moment to walk through some of the very strange (and surprisingly beautiful) things we found. Firstly: Superposition by itself is utterly wild if you take it seriously.

It's basically saying that neural networks might be bigger -- potentially *much* bigger -- than their surface structure implies, simulating a larger sparser system.
Sep 6, 2022 12 tweets 3 min read
This feels a bit awkward, but since there's been so much debate on whether dating docs are a good idea, here's a quick update on how this has been going, one week later.

Summary: So far, 54 people have reached out to me to ask me on a date or offer to introduce me to someone. (More on this below, but please don't let this thread discourage you from reaching out if you think we might be a good fit! I intend to respond to every sincere message I receive and if we aren't a fit, I'll try to think about whether I know someone who's a better match for you.)
Aug 30, 2022 16 tweets 5 min read
Normal online dating seems pretty suboptimal. Recently, I’ve seen several people experiment with public “date me” docs – I think this is a really interesting experiment in alternatives, enabling long-from, earnest dating profiles.

So, I wrote my own: docs.google.com/document/d/1fs… Image Before diving in – I normally try to tweet about research or things of broad interest, since I know people don’t follow me for my personal life. This thread is really important to me, but apologies for the topic change if it’s unwelcome.
Jun 27, 2022 10 tweets 4 min read
After five years as a manager, I've decided I want to focus on research and research leadership and do less management. As a result, we're looking for a new manager to co-lead @AnthropicAI's mechanistic interpretability team with me! Our team's goal is to reverse engineer neural networks into human understandable computer programs – see our work at transformer-circuits.pub and some previous work we're inspired by at distill.pub/2020/circuits/ .
Jun 4, 2022 21 tweets 4 min read
The elegance of ML is the elegance of biology, not the elegance of math or physics.

Simple gradient descent creates mind-boggling structure and behavior, just as evolution creates the awe inspiring complexity of nature. I used to really want ML to be about complex math and clever proofs. But I've gradually come to think this is really the wrong aesthetic to bring.
Jan 17, 2022 9 tweets 2 min read
Hypothesis: It's easier to have integrity when those close to you cherish your *values* and the *process* that leads to your beliefs, rather than the beliefs, positions, and affiliations one holds. The truth is, part of me _really_ cares what my close friends think of me. (And when I'm dating someone, it's hard to articulate just how much I want to make them proud of me.)

This desire isn't a moral failure on my part, it's the nature of being human.
Jan 9, 2022 9 tweets 3 min read
There's a pretty significant body of literature suggesting that men and women (at least at undergraduate age) are *systematically misinformed and overoptimistic* about fertility.

We need to talk about this more. From the related work section of Habbema et al (ncbi.nlm.nih.gov/pmc/articles/P…):
Nov 25, 2021 17 tweets 3 min read
Market segmentation (as I understand it) is a standard business thing where companies group potential buyers and target them differently.

It seems to me that market segmentation is often a moral issue as well as a business one, but people don't usually discuss it that way? (Caveat: I don't know much about business / economics / etc. Just random shower thoughts on something I'm not an expert on.)
Sep 21, 2021 4 tweets 1 min read
Tristan Needham's "Visual Complex Analysis" is probably the best piece of technical exposition I've ever read.

I just learned that he published a new book last year, taking a visual approach to differential geometry. I'm so excited to receive my copy! Visual Complex Analysis literally changed the course of my life in several ways. It got me interested in visual explanations, and set an aspirational goal for me.
Sep 7, 2021 5 tweets 3 min read
People often remark on the quality of my diagrams and visualizations. Well, here are some diagrams I made literally a decade ago! And I was trying really hard for that post! Everyone starts as a beginner. From this post (please keep in mind I was a teenager): christopherolah.wordpress.com/2011/08/08/the…

The 3D renderings were cool!

There's likely errors, but it was probably the first essay I wrote that went from "completely terrible" to "some glimmers of not terrible".