Latest Twitter Threads by @janleike on Thread Reader App

Feb 6 • 6 tweets • 2 min read

Why are we working on jailbreaking robustness? 🧵👇

https://twitter.com/janleike/status/1886452525437800874

More capable LLMs can be misused to cause more harm. E.g., what if a terrorist can build a weapon of mass destruction with step-by-step guidance from an LLM?

To be clear: current LLMs aren’t good at this. But once they are, we want to prevent them from being misused in this way.

Jun 27, 2024 • 10 tweets • 3 min read

Very exciting that this is out now (from my time at OpenAI):

We trained an LLM critic to find bugs in code, and this helps humans find flaws on real-world production tasks that they would have missed otherwise.

A promising sign for scalable oversight!

openai.com/index/finding-…

Why are we doing this? RLHF is fundamentally limited by humans' ability to evaluate models – it won't scale well.

Scalable oversight aims to alleviate this problem by using AI to help humans evaluate. Here we tried the simplest idea: train a critic to point out flaws.

Jun 6, 2024 • 6 tweets • 2 min read

This is super cool work! Sparse autoencoders are the currently most promising approach to actually understanding how models "think" internally.

This new paper demonstrates how to scale them to GPT-4 and beyond – completely unsupervised.

A big step forward!

https://twitter.com/nabla_theta/status/1798763600741585066

What are sparse auto-encoders (SAEs)?

The SAEs in this paper consist of two parts, an encoder and a decoder.

The encoder is a linear transformation from the model's internal state ("what the model is thinking about") to "concept space". By passing the model's internal state through this linear transformation we get the most active concepts related to this internal state.

That this transformation is linear means that in some sense it's "simple": almost all of the "work" of extracting the relevant concepts is done by the model, not the SAE.

The decoder is another linear transformation from the concept space back to the internal state.

This decoder is important for training the encoder: we train encoder and decoder together by reducing the "reconstruction error" -- the difference between the original internal state of the model and the approximation we get by chaining encoder and decoder together. In other words, we use the decoder to translate from concept space back to the model's internal state as faithfully as possible.

For SAEs we want the features in concept space to be *sparse*, meaning that at any given time only a small number of concepts are active (think 500 out of 16 million). This makes sense intuitively because in any given situation only a small number of concepts apply: most objects are not apples, most animals are not horses, most sentences are not rhetorical questions, and so on.

May 17, 2024 • 13 tweets • 2 min read

Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI. It's been such a wild journey over the past ~3 years. My team launched the first ever RLHF LLM with InstructGPT, published the first scalable oversight on LLMs, pioneered automated interpretability and weak-to-strong generalization. More exciting stuff is coming out soon.

Apr 10, 2024 • 5 tweets • 1 min read

The superalignment fast grants are now decided!

We got a *ton* of really strong applications, so unfortunately we had to say no to many we're very excited about.

There is still so much good research waiting to be funded.

Congrats to all recipients! We just sent emails notifying all applicants. I'm sorry if your application didn't make it in, please don't let this dissuade you from pursuing research in alignment!

Mar 12, 2024 • 4 tweets • 1 min read

Today we're releasing a tool we've been using internally to analyze transformer internals - the Transformer Debugger!

It combines both automated interpretability and sparse autoencoders, and it allows rapid exploration of models without writing code. It supports both neurons and attention heads.

You can intervene on the forward pass by ablating individual neurons and see what changes.

In short, it's a quick and easy way to discover circuits manually.

Dec 18, 2023 • 7 tweets • 2 min read

I'm very excited that today OpenAI adopts its new preparedness framework!

This framework spells out our strategy for measuring and forecasting risks, and our commitments to stop deployment and development if safety mitigations are ever lagging behind.

openai.com/safety/prepare… Some highlights:

1. Safety baselines: There are 4 risk levels: low, medium, high, and critical. High prevents deployment, critical prevents further development.

2. Measurements: We measure dangerous capabilities continually, at least every 2x effective compute increase.

Dec 14, 2023 • 8 tweets • 3 min read

Super excited about our new research direction for aligning smarter-than-human AI:

We finetune large models to generalize from weak supervision—using small models instead of humans as weak supervisors.

Check out our new paper:
openai.com/research/weak-…

For lots of important tasks we don't have ground truth supervision:
Is this statement true?
Is this code buggy?

We want to elicit the strong model's capabilities on these tasks without access to ground truth.

This is pretty central to aligning superhuman models.

May 9, 2023 • 7 tweets • 3 min read

Really exciting new work on automated interpretability:

We ask GPT-4 to explain firing patterns for individual neurons in LLMs and score those explanations.

openai.com/research/langu… This is a new direction for simultaneously getting two things that we want out of interpretability:

1. understanding the model at the level of detail of individual neurons

2. running over the entire model so we don't miss anything important

Aug 24, 2022 • 5 tweets • 2 min read

How will we solve the alignment problem for AGI?

I've been working on this question for almost 10 years now.

Our current path is very promising:
openai.com/blog/our-appro…

1/ 1. We're training AI using human feedback

By optimizing for human preferences we get systems that understand what humans intent and try to achieve it.

Best example: InstructGPT
openai.com/blog/instructi…

Jan 27, 2022 • 6 tweets • 3 min read

Extremely exciting alignment research milestone:

Using reinforcement learning from human feedback, we've trained GPT-3 to be much better at following human intentions.

openai.com/blog/instructi… Disclaimer: There isn't really much new in terms of methods or algorithms.

But it shows that our alignment techniques work in a real-world setting.

This is cheaper and more effective at making a more useful model than just making it bigger.

Share this page!

Enter URL or ID to Unroll