How to get URL link on X (Twitter) App
Features for deception were active over the transcript. Was the model intentionally being deceptive?
The task we study is knowing when to break the line in fixed-width text.
A key component of transformers is attention, which directs the flow of information from one token to another, and connects features.
https://twitter.com/AnthropicAI/status/1928119229384970244
The initial release lets you generate graphs for small open-weights models. You can just type a prompt and see an explanation of the key steps involved in generating the next token!
For context, the goal of dictionary learning is to untangle the activations inside the neurons of an LLM into a small set of interpretable features.
1/ Feature superposition