Latest Twitter Threads by @mlpowered on Thread Reader App

Feb 5 • 5 tweets • 2 min read

We just shipped Claude Opus 4.6!

I’m also excited to share that for the first time, we used circuit tracing as part of the model's safety audit!

We studied why sometimes, the model misrepresents the results of tool calls.

Features for deception were active over the transcript. Was the model intentionally being deceptive?

The circuit offers a simpler explanation: While calling the tool, the model precomputes the correct answer “in its head”.

Then, it attends to that rather than the tool output.

Oct 21, 2025 • 10 tweets • 4 min read

How does an LLM compare two numbers? We studied this in a common counting task, and were surprised to learn that the algorithm it used was:

Put each number on a helix, and then twist one helix to compare it to the other.

Not your first guess? Not ours either. 🧵

The task we study is knowing when to break the line in fixed-width text.

We chose it for two reasons:
While unconscious for humans (you just see when you're out of room), models don't have eyes - they only see tokens
It is so common that models like Claude are very good at it

Jul 31, 2025 • 8 tweets • 4 min read

Earlier this year, we showed a method to interpret the intermediate steps a model takes to produce an answer.

But we were missing a key bit of information: explaining why the model attends to specific concepts.

Today, we do just that 🧵

A key component of transformers is attention, which directs the flow of information from one token to another, and connects features.

In this work, we explain attention patterns by decomposing them into a list of feature/feature interactions.

We find neat things, for example

May 29, 2025 • 6 tweets • 3 min read

The methods we used to trace the thoughts of Claude are now open to the public!

Today, we are releasing a library which lets anyone generate graphs which show the internal reasoning steps a model used to arrive at an answer.

https://twitter.com/AnthropicAI/status/1928119229384970244

The initial release lets you generate graphs for small open-weights models. You can just type a prompt and see an explanation of the key steps involved in generating the next token!

Try it on Gemma-2-2B, it only takes a few seconds.

neuronpedia.org/gemma-2-2b/gra…

May 21, 2024 • 9 tweets • 4 min read

Today, we announced that we’ve gotten dictionary learning working on Sonnet, extracting millions of features from one of the best models in the world.

This is the first time this has been successfully done on a frontier model.

I wanted to share some highlights 🧵

For context, the goal of dictionary learning is to untangle the activations inside the neurons of an LLM into a small set of interpretable features.

We can then look at these features to inspect what is happening inside the model as it processes a given context.

Mar 4, 2024 • 8 tweets • 3 min read

Claude 3 Opus is great at following multiple complex instructions.

To test it, @ErikSchluntz and I had it take on @karpathy's challenge to transform his 2h13m tokenizer video into a blog post, in ONE prompt, and it just... did it

Here are some details:

First, we grabbed the raw transcript of the video and screenshots taken at 5s intervals.

Then, we chunked the transcript into 24 parts for efficient processing (the whole transcript fits within the context window, so this is merely a speed optimization).

Jan 18, 2023 • 9 tweets • 2 min read

I just finished watching @karpathy's let's build GPT lecture, and I think it might be the best in the zero-to-hero series so far.

Here are eight insights about transformers that the video did a great job explaining.

Watch the video for more.

(1/9) 1. Transformers as sum of attention blocks

A transformer is mostly a stack of attention blocks. These work similarly in encoders and decoders (see difference below). Each attention block contains multiple heads, allowing each head to attend to different types of information.

Oct 19, 2022 • 11 tweets • 5 min read

Most ML folks I know have @AnthropicAI's Toy Models of Superposition paper on their reading list, but too few have read it.

It is one of the most interesting interpretability paper I've read in a while and it can benefit anyone using deep learning.

Here are my takeaways!

1/ Feature superposition

The simplest way to represent useful input features in a hidden layer is by using one neuron per feature.

But what happens if you have more useful features than neurons?

How do you compress your features to fit them using fewer neurons?

Oct 2, 2018 • 6 tweets • 3 min read

Just finished Ethics and Data Science, by @mikeloukides, @hmason, @dpatil on @OReillyMedia. Would highly recommend it as it covers a complex from a broad view and with applied tips. Many great takeaways. I've summarized a few below (amazon.com/Ethics-Data-Sc…) Consent requires clarity: It is required to ask users for the ability to use their data, but these requests are often vague and thus lead to breaches of trust. If I agree to give my data to Facebook for security, can it use it for anything? (techcrunch.com/2018/09/27/yes…)

Share this page!

Enter URL or ID to Unroll