Tweet

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Anthropic

@AnthropicAI

Oct 5 • 11 tweets • 4 min read Twitter logo

Read on Twitter

The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.

We hope this will eventually enable us to diagnose failure modes, design fixes, and certify that models are safe for adoption by enterprises and society. It's much easier to tell if something is safe if you can understand how it works!

Most neurons in language models are "polysemantic" – they respond to multiple unrelated things. For example, one neuron in a small language model activates strongly on academic citations, English dialogue, HTTP requests, Korean text, and others.

Last year, we conjectured that polysemanticity is caused by "superposition" – models compressing many rare concepts into a small number of neurons. We also conjectured that "dictionary learning" might be able to undo superposition.

x.com/AnthropicAI/st…

Dictionary learning works! Using a "sparse autoencoder", we can extract features that represent purer concepts than neurons do. For example, turning ~500 neurons into ~4000 features uncovers things like DNA sequences, HTTP requests, and legal text.

📄 transformer-circuits.pub/2023/monoseman…

Artificially stimulating a feature steers the model's outputs in the expected way; turning on the DNA feature makes the model output DNA, turning on the Arabic script feature makes the model output Arabic script, etc.

📄 transformer-circuits.pub/2023/monoseman…

Features connect in "finite-state automata"-like systems that implement complex behaviors. For example, we find features that work together to generate valid HTML.

📄 transformer-circuits.pub/2023/monoseman…

We also systematically show that the features we find are more interpretable than the neurons, using both a blinded human evaluator and a large language model (autointerpretability).

📄 transformer-circuits.pub/2023/monoseman…

Our research builds on work on sparse coding and distributed representations in neuroscience, disentanglement and dictionary learning in machine learning, and compressed sensing in mathematics.

There's a lot more in the paper if you're interested, including universality, "feature splitting", more evidence for the superposition hypothesis, and tips for training a sparse autoencoder to better understand your own network!

📄transformer-circuits.pub/2023/monoseman…

If this work excites you, our interpretability team is hiring! See our job descriptions below.

Research Scientist, Interpretability:

Research Engineer, Interpretability: jobs.lever.co/Anthropic/33dc…
jobs.lever.co/Anthropic/3dbd…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @AnthropicAI

Anthropic

@AnthropicAI

Aug 8

Large language models have demonstrated a surprising range of skills and behaviors. How can we trace their source? In our new paper, we use influence functions to find training examples that contribute to a given model output.

Influence functions are a classic technique from statistics. They are formulated as a counterfactual: if a copy of a given training sequence were added to the dataset, how would that change the trained parameters (and, by extension, the model’s outputs)?

Directly evaluating this counterfactual by re-training the model would be prohibitively expensive, so we’ve developed efficient algorithms that let us approximate influence functions for LLMs with up to 52 billion parameters: arxiv.org/abs/2308.03296

Read 11 tweets

Anthropic

@AnthropicAI

Jul 18

When language models “reason out loud,” it’s hard to know if their stated reasoning is faithful to the process the model actually used to make its prediction. In two new papers, we measure and improve the faithfulness of language models’ stated reasoning.

We make edits to the model’s chain of thought (CoT) reasoning to test hypotheses about how CoT reasoning may be unfaithful. For example, the model’s final answer should change when we introduce a mistake during CoT generation.

For some tasks, forcing the model to answer with only a truncated version of its chain of thought often causes it to come to a different answer, indicating that the CoT isn’t just a rationalization. The same is true when we introduce mistakes into the CoT.

Read 13 tweets

Anthropic

@AnthropicAI

Jul 11

Introducing Claude 2! Our latest model has improved performance in coding, math and reasoning. It can produce longer responses, and is available in a new public-facing beta website at in the US and UK. https://t.co/jSkvbXnqLdclaude.ai

Claude 2 has improved from our previous models on evaluations including Codex HumanEval, GSM8K, and MMLU. You can see the full suite of evaluations in our model card: https://t.co/LLOuUNfOFVwww-files.anthropic.com/production/ima…

Although no model is immune to jailbreaks, we’ve used Constitutional AI and automated red-teaming to make Claude 2 more harmless and harder to prompt to produce offensive or dangerous output.

Read 6 tweets

Anthropic

@AnthropicAI

Jun 29

We develop a method to test global opinions represented in language models. We find the opinions represented by the models are most similar to those of the participants in USA, Canada, and some European countries. We also show the responses are steerable in separate experiments.

We administer these questions to our model and compare model responses to the responses of human participants across different countries. We release our evaluation dataset at: https://t.co/vLj27i7Fvqhuggingface.co/datasets/Anthr…

We present an interactive visualization of the similarity results on a map to explore how prompt based interventions influence whose opinions the models are the most similar to. llmglobalvalues.anthropic.com

Read 8 tweets

Anthropic

@AnthropicAI

Jun 22

https://twitter.com/compdem/status/1671923566919692289

We collaborated with @compdem to research the opportunities and risks of augmenting the platform with language models (LMs) to facilitate open and constructive dialogue between people with diverse viewpoints. https://t.co/Fo8S1aqJNKPol.is

https://twitter.com/compdem/status/1671923566919692289

We analyzed a 2018 conversation run in Bowling Green, Kentucky when the city was deeply divided on national issues. @compdem, academics, local media, and expert facilitators used https://t.co/5gopxi9woV to identify consensus areas. https://t.co/NO8Wbk5EcJPol.is
Pol.is
compdemocracy.org/Case-studies/2…

We find evidence that LMs have promising potential to help human facilitators and moderators synthesize the outcomes of online digital town halls—a role that requires significant expertise in quantitative & qualitative data analysis, the topic of debate, and writing skills.

Read 6 tweets

Anthropic

@AnthropicAI

May 11

Introducing 100K Context Windows! We’ve expanded Claude’s context window to 100,000 tokens of text, corresponding to around 75K words. Submit hundreds of pages of materials for Claude to digest and analyze. Conversations with Claude can go on for hours or days.

We fed Claude-Instant The Great Gatsby (72K tokens), except we modified one line to say that Mr. Carraway was "a software engineer that works on machine learning tooling at Anthropic." We asked the model to spot what was added - it responded with the right answer in 22 seconds.

Claude can help retrieve information from business documents. Drop multiple documents or even a book into the prompt and ask Claude questions that require synthesis of knowledge across many parts of the text.

Read 7 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Anthropic

Try unrolling a thread yourself!

More from @AnthropicAI

Anthropic

Anthropic

Anthropic

Anthropic

Anthropic

Anthropic

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!