Latest Twitter Threads by @ch402 on Thread Reader App

Mar 27 • 11 tweets • 3 min read

A few reasons why I'm really excited about this project!

https://twitter.com/AnthropicAI/status/1905303835892990278

(1) After a long side quest, we're getting closer to circuits.

We had circuits on InceptionV1 and it was great ( ).

But superposition made this not work in transformers.

With SAEs resolving this barrier, it's natural to return to circuits...distill.pub/2020/circuits/

Jan 3 • 23 tweets • 5 min read

@kchonyc wrote a thoughtful note on the AI job market. I think he gets some important things right. Other things seem different from my perspective.

My own stream of consciousness inspired by @kchonyc's essay and the broader discourse.

https://twitter.com/kchonyc/status/1870563085796184131

As @kchonyc observes, the deep learning job market was in a very strange place circa 2014. Deep learning went from being very obscure in 2012 (probably <<100 people working on it seriously), to being very in demand.

May 21, 2024 • 14 tweets • 4 min read

I'm really excited about these results for many reasons, but the most important is that we're starting to connect mechanistic interpretability to questions about the safety of large language models.

https://twitter.com/AnthropicAI/status/1792935506587656625

I've been working on interpretability for more than a decade, significantly motivated by concerns about safety. But it's always been this aspirational goal – I could tell a story for how this work might someday help, but it was far off.

Oct 5, 2023 • 18 tweets • 5 min read

If you'd asked me a year ago, superposition would have been by far the reason I was most worried that mechanistic interpretability would hit a dead end.

I'm now very optimistic. I'd go as far as saying it's now primarily an engineering problem -- hard, but less fundamental risk.

https://twitter.com/AnthropicAI/status/1709986949711200722

Well trained sparse autoencoders (scale matters!) can decompose a one-layer model into very nice, interpretable features.

Jun 7, 2023 • 14 tweets • 4 min read

One of the ideas I find most useful from @AnthropicAI's Core Views on AI Safety post (anthropic.com/index/core-vie…) is thinking in terms of a distribution over safety difficulty.

Here's a cartoon picture I like for thinking about it:

A lot of AI safety discourse focuses on very specific models of AI and AI safety. These are interesting, but I don't know how I could be confident in any one.

I prefer to accept that we're just very uncertain. One important axis of that uncertainty is roughly "difficulty".

May 18, 2023 • 5 tweets • 2 min read

Very cool to see more parallels between neuroscience and mechanistic interpretability!

There's a long history of us discovering analogs of long-known results in neuroscience -- it's exciting to see a bit of the reverse! (Also neat to see use of interpretability methods!)

https://twitter.com/AToliasLab/status/1659236747618680836

When we were working on vision models, we constantly discovered that features we were discovering such as curve detectors (distill.pub/2020/circuits/…) and multimodal neurons (distill.pub/2021/multimoda…) had existing parallels in neuroscience.

Mar 23, 2023 • 10 tweets • 4 min read

High-Low frequency detectors found in biology!

Very cool to see mechanistic interpretability seeming to predict in advance the existence of biological neuroscience phenomena. 🔬

https://twitter.com/zeeweeding/status/1638257954398011394

(Caveat that I'm not super well read in neuroscience. When working on high-low frequency detectors we asked a number of neuroscientists if they'd seen such neurons before, and they hadn't. But very possible something was missed!)

Mar 9, 2023 • 8 tweets • 2 min read

In 1939, physicists discussed voluntarily adopting secrecy in atomic physics. Reading Rhodes' book, one can hear three "camps":

- Caution
- Scientific Humility
- Openness Idealism

I feel like these are the same camps I hear in discussion of AI (eg. open sourcing models).

I ultimately lean towards caution, but I empathize with all three of these stances.

Jan 5, 2023 • 13 tweets • 5 min read

The former Distill editor in me is quite please with how our experiment with invited comments on Transformer Circuits articles has been turning out! For context, several of our recent articles have had "invited comments" which were highly substantive, curated comments from engaged researchers.

Jan 5, 2023 • 12 tweets • 4 min read

I'm pretty excited about this as a step towards a mechanistic theory of memorization and overfitting.

https://twitter.com/AnthropicAI/status/1611045993516249088

Historically, I think I've often been tempted to think of overfit models as pathological things that need to be trained better in order for us to interpret.

For example, in this paper we found that overfitting seemed to cause less interpretable features: distill.pub/2020/understan…

Jan 1, 2023 • 9 tweets • 2 min read

An idea I've been toying with: The role of many toy models may be to serve as "counterexamples in mechanistic interpretability".

https://twitter.com/NeelNanda5/status/1609283649119322114

For background, there's a famous serious of books in math titled "counterexamples in X" (eg. the famous "Counterexamples in Topology") which offer examples of mathematical objects with unusual properties.

Sep 27, 2022 • 25 tweets • 6 min read

A few people have asked me how our recent superposition paper (transformer-circuits.pub/2022/toy_model…) relates to classical ideas in neural coding / connectionism / distributed representations.

I wanted to share a few thoughts, although I'll caveat I'm very much not an expert on these topics! When I try to read foundational papers on these topics it feels to me like there are actually two different ideas sharing the term "distributed representations". I'll call them "composability" and "superposition".

(These turn out to be opposing properties in some ways!)

Sep 25, 2022 • 7 tweets • 3 min read

After noticing that tomatillos taste suspiciously like golden berries, I spent the last hour staring at phylogenetic trees of edible plants. It turns out that yes, they're related, and there are lots of other interesting relationships.

One nice overview: webdoc.agsci.colostate.edu/cropsforhealth…

This blog has a much more detailed discussion of edible plants in different clades botanistinthekitchen.blog/the-plant-food…

Sep 14, 2022 • 14 tweets • 5 min read

I've never had so many "this can't possibly be true, we must have a bug" results in the course of a research project before.

I'd like to take a moment to walk through some of the very strange (and surprisingly beautiful) things we found.

https://twitter.com/AnthropicAI/status/1570087876053942272

Firstly: Superposition by itself is utterly wild if you take it seriously.

It's basically saying that neural networks might be bigger -- potentially *much* bigger -- than their surface structure implies, simulating a larger sparser system.

Sep 6, 2022 • 12 tweets • 3 min read

This feels a bit awkward, but since there's been so much debate on whether dating docs are a good idea, here's a quick update on how this has been going, one week later.

Summary: So far, 54 people have reached out to me to ask me on a date or offer to introduce me to someone.

https://twitter.com/ch402/status/1564631228166201345

(More on this below, but please don't let this thread discourage you from reaching out if you think we might be a good fit! I intend to respond to every sincere message I receive and if we aren't a fit, I'll try to think about whether I know someone who's a better match for you.)

Aug 30, 2022 • 16 tweets • 5 min read

Normal online dating seems pretty suboptimal. Recently, I’ve seen several people experiment with public “date me” docs – I think this is a really interesting experiment in alternatives, enabling long-from, earnest dating profiles.

So, I wrote my own: docs.google.com/document/d/1fs…

Before diving in – I normally try to tweet about research or things of broad interest, since I know people don’t follow me for my personal life. This thread is really important to me, but apologies for the topic change if it’s unwelcome.

Jun 27, 2022 • 10 tweets • 4 min read

After five years as a manager, I've decided I want to focus on research and research leadership and do less management. As a result, we're looking for a new manager to co-lead @AnthropicAI's mechanistic interpretability team with me!

https://twitter.com/AnthropicAI/status/1541468008249364481

Our team's goal is to reverse engineer neural networks into human understandable computer programs – see our work at transformer-circuits.pub and some previous work we're inspired by at distill.pub/2020/circuits/ .

Jun 4, 2022 • 21 tweets • 4 min read

The elegance of ML is the elegance of biology, not the elegance of math or physics.

Simple gradient descent creates mind-boggling structure and behavior, just as evolution creates the awe inspiring complexity of nature.

https://twitter.com/banburismus_/status/1532747777280593920

I used to really want ML to be about complex math and clever proofs. But I've gradually come to think this is really the wrong aesthetic to bring.

Jan 17, 2022 • 9 tweets • 2 min read

Hypothesis: It's easier to have integrity when those close to you cherish your *values* and the *process* that leads to your beliefs, rather than the beliefs, positions, and affiliations one holds. The truth is, part of me _really_ cares what my close friends think of me. (And when I'm dating someone, it's hard to articulate just how much I want to make them proud of me.)

This desire isn't a moral failure on my part, it's the nature of being human.

Jan 9, 2022 • 9 tweets • 3 min read

There's a pretty significant body of literature suggesting that men and women (at least at undergraduate age) are *systematically misinformed and overoptimistic* about fertility.

We need to talk about this more.

https://twitter.com/saykay/status/1480022995393413120

From the related work section of Habbema et al (ncbi.nlm.nih.gov/pmc/articles/P…):

Nov 25, 2021 • 17 tweets • 3 min read

Market segmentation (as I understand it) is a standard business thing where companies group potential buyers and target them differently.

It seems to me that market segmentation is often a moral issue as well as a business one, but people don't usually discuss it that way? (Caveat: I don't know much about business / economics / etc. Just random shower thoughts on something I'm not an expert on.)

Share this page!

Enter URL or ID to Unroll