Jan Leike Profile picture
May 17 13 tweets 2 min read Read on X
Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI.
It's been such a wild journey over the past ~3 years. My team launched the first ever RLHF LLM with InstructGPT, published the first scalable oversight on LLMs, pioneered automated interpretability and weak-to-strong generalization. More exciting stuff is coming out soon.
I love my team.

I'm so grateful for the many amazing people I got to work with, both inside and outside of the superalignment team.

OpenAI has so much exceptionally smart, kind, and effective talent.
Stepping away from this job has been one of the hardest things I have ever done, because we urgently need to figure out how to steer and control AI systems much smarter than us.
I joined because I thought OpenAI would be the best place in the world to do this research.

However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.
I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics.
These problems are quite hard to get right, and I am concerned we aren't on a trajectory to get there.
Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.
Building smarter-than-human machines is an inherently dangerous endeavor.

OpenAI is shouldering an enormous responsibility on behalf of all of humanity.
But over the past years, safety culture and processes have taken a backseat to shiny products.
We are long overdue in getting incredibly serious about the implications of AGI.

We must prioritize preparing for them as best we can.

Only then can we ensure AGI benefits all of humanity.
OpenAI must become a safety-first AGI company.
To all OpenAI employees, I want to say:

Learn to feel the AGI.

Act with the gravitas appropriate for what you're building.

I believe you can "ship" the cultural change that's needed.

I am counting on you.

The world is counting on you.

:openai-heart:

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jan Leike

Jan Leike Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @janleike

Apr 10
The superalignment fast grants are now decided!

We got a *ton* of really strong applications, so unfortunately we had to say no to many we're very excited about.

There is still so much good research waiting to be funded.

Congrats to all recipients!
We just sent emails notifying all applicants. I'm sorry if your application didn't make it in, please don't let this dissuade you from pursuing research in alignment!
Some statistics on the superalignment fast grants:

We funded 50 out of ~2,700 applications, awarding a total of $9,895,000.

Median grant size: $150k
Average grant size: $198k
Smallest grant size: $50k
Largest grant size: $500k

Grantees:
Universities: $5.7m (22)
Graduate students: $3.6m (25)
Nonprofits: $250k (1)
Individuals: $295k (2)

Research areas funded (some proposals cover multiple areas, so this sums to >$10m):
Weak-to-strong generalization: $5.2m (26)
Scalable oversight: $1m (5)
Top-down interpretability: $1.9m (9)
Mechanistic interpretability: $1.2m (6)
Chain-of-thought faithfulness: 700k (2)
Adversarial robustness 650k (4)
Data attribution: 300k (1)
Evals/prediction: 700k (4)
Other: $1m (6)
Read 5 tweets
Mar 12
Today we're releasing a tool we've been using internally to analyze transformer internals - the Transformer Debugger!

It combines both automated interpretability and sparse autoencoders, and it allows rapid exploration of models without writing code.
It supports both neurons and attention heads.

You can intervene on the forward pass by ablating individual neurons and see what changes.

In short, it's a quick and easy way to discover circuits manually.
This is still an early stage research tool, but we are releasing to let others play with and build on it!

Check it out:

github.com/openai/transfo…
Read 4 tweets
Dec 18, 2023
I'm very excited that today OpenAI adopts its new preparedness framework!

This framework spells out our strategy for measuring and forecasting risks, and our commitments to stop deployment and development if safety mitigations are ever lagging behind.

openai.com/safety/prepare…
Some highlights:

1. Safety baselines: There are 4 risk levels: low, medium, high, and critical. High prevents deployment, critical prevents further development.

2. Measurements: We measure dangerous capabilities continually, at least every 2x effective compute increase.
3. Security: Our security measures must be sufficient to prevent safety mitigations from being circumvented via model exfiltration.

4. Alignment: We have to show that our models never initiate critical-level tasks without explicitly being instructed to do so.
Read 7 tweets
Dec 14, 2023
Super excited about our new research direction for aligning smarter-than-human AI:

We finetune large models to generalize from weak supervision—using small models instead of humans as weak supervisors.

Check out our new paper:
openai.com/research/weak-…
Image
For lots of important tasks we don't have ground truth supervision:
Is this statement true?
Is this code buggy?

We want to elicit the strong model's capabilities on these tasks without access to ground truth.

This is pretty central to aligning superhuman models.
We find that large models generally do better than their weak supervisor (a smaller model), but not by much.

This suggests reward models won't be much better than their human supervisors.

In other words: RLHF won't scale.
Read 8 tweets
May 9, 2023
Really exciting new work on automated interpretability:

We ask GPT-4 to explain firing patterns for individual neurons in LLMs and score those explanations.

openai.com/research/langu…
This is a new direction for simultaneously getting two things that we want out of interpretability:

1. understanding the model at the level of detail of individual neurons

2. running over the entire model so we don't miss anything important
What's exciting is that it gives us a way to measure how good an neuron explanation is: we simulate how well a human would predict future firing pattern and compare this to actual firing patterns.

Right now this measure isn't that accurate, but it'll improve with better LLMs!
Read 7 tweets
Aug 24, 2022
How will we solve the alignment problem for AGI?

I've been working on this question for almost 10 years now.

Our current path is very promising:
openai.com/blog/our-appro…

1/
1. We're training AI using human feedback

By optimizing for human preferences we get systems that understand what humans intent and try to achieve it.

Best example: InstructGPT
openai.com/blog/instructi…
2. We're training AI to assist human evaluation

This is critical to scale learning from feedback since it enables humans to state their preferences even on very complex tasks that they otherwise don't understand well.

Best example: AI-written critiques
openai.com/blog/critiques
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(