Tweet

Anthropic

Sep 14 • 11 tweets • 3 min read

Neural networks often pack many unrelated concepts into a single neuron – a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. In our latest work, we build toy models where the origins of polysemanticity can be fully understood.

If a neural network has 𝑛 feature dimensions, you might intuitively expect that it will store 𝑛 different features. But it turns out that neural networks can store more than 𝑛 features in "superposition" if the features are sparse!
transformer-circuits.pub/2022/toy_model…

For example, if a word embedding has 512 dimensions, one might think it has 512 features (eg. verb-noun, male-female, singular-plural, …). Superposition suggests there may be many more – they're just not exactly orthogonal.

(This is very similar to ideas in compressed sensing!)

Superposition doesn't occur in linear models. It only occurs if one introduces a non-linearity that can filter out noise from superposition "interference". It also requires sparsity: as features become sparser, there is more superposition.

It turns out that whether a feature is stored in a dedicated dimension, not learned at all, or represented in superposition is governed by a discontinuous phase change between three regimes!

Amazingly, features in superposition are organized into subspaces with geometry corresponding to regular polytopes (pentagons, tetrahedra) and other solutions to the Thomson problem. This creates discretized “energy levels” in the amount of dimensionality allocated to features. $We can calculate the “fract...$

We also find that neural nets can do simple computation on features while they're in superposition. This suggests a wild possibility: neural networks may, in some sense, be simulating larger sparse models.

If superposition is widespread, understanding it will be essential for interpretability (especially models like transformers where polysemanticity is dominant). This is a small first step to understanding it.

The ideas in this paper build on lots of previous work, including papers hypothesizing the existence of superposition. It's also connected to ideas in compressed sensing, distributed representations, disentanglement, and other work.

There's a lot more in the paper if you're interested, including phase changes, beautiful geometric structure, surprising learning dynamics, a connection to adversarial examples, and much more!
transformer-circuits.pub/2022/toy_model…

Superposition is easy to explore -- you can play with it on a single GPU! We have a notebook reproducing our basic results to start you off: colab.research.google.com/github/anthrop…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @AnthropicAI

Anthropic

@AnthropicAI

Aug 25

We examine which safety techniques for LMs are more robust to human-written, adversarial inputs (“red teaming”) and find that RL from Human Feedback scales the best out of the methods we studied. We also release our red team data so others can also use it to build safer models.

In “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned” we describe our early efforts to “red team” language models (LMs) in the form of AI assistants. anthropic.com/red_teaming.pdf

To illustrate what “successful” red teaming looks like, the image below shows a red team attack that tricks our least safe assistant, a plain language model, into helping out with a hypothetical drug deal. The annotations show how we quantify the attack.

Read 11 tweets

Anthropic

@AnthropicAI

Jul 13

In "Language Models (Mostly) Know What They Know", we show that language models can evaluate whether what they say is true, and predict ahead of time whether they'll be able to answer questions correctly. arxiv.org/abs/2207.05221

We study a separate predictor for these two tasks: P(True) for whether statements are true, and P(IK) = the probability that "I know" the answer to a question. We evaluate on trivia, story completion, arithmetic, math word problems, and python programming.

But our story begins with calibration: when an AI predicts a probability like 80%, does the corresponding event actually occur 80% of the time? We show that for a variety of multiple choice tasks, with the right format, large language models are well calibrated.

Read 10 tweets

Anthropic

@AnthropicAI

Dec 22, 2021

Our first interpretability paper explores a mathematical framework for trying to reverse engineer transformer language models: A Mathematical Framework for Transformer Circuits: transformer-circuits.pub/2021/framework…

We try to mechanistically understand some small, simplified transformers in detail, as a first step toward understanding large transformer language models.

We show that a one-layer attention-only transformer can be understood as a "skip-trigram model" and that the skip-trigrams can be extracted from model weights without running the model.

Read 7 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Anthropic

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @AnthropicAI

Anthropic

Anthropic

Anthropic

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?