Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Jan Leike

@janleike

May 17 • 13 tweets • 2 min read • Read on X

Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI.

It's been such a wild journey over the past ~3 years. My team launched the first ever RLHF LLM with InstructGPT, published the first scalable oversight on LLMs, pioneered automated interpretability and weak-to-strong generalization. More exciting stuff is coming out soon.

I love my team.

I'm so grateful for the many amazing people I got to work with, both inside and outside of the superalignment team.

OpenAI has so much exceptionally smart, kind, and effective talent.

Stepping away from this job has been one of the hardest things I have ever done, because we urgently need to figure out how to steer and control AI systems much smarter than us.

I joined because I thought OpenAI would be the best place in the world to do this research.

However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.

I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics.

These problems are quite hard to get right, and I am concerned we aren't on a trajectory to get there.

Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.

Building smarter-than-human machines is an inherently dangerous endeavor.

OpenAI is shouldering an enormous responsibility on behalf of all of humanity.

But over the past years, safety culture and processes have taken a backseat to shiny products.

We are long overdue in getting incredibly serious about the implications of AGI.

We must prioritize preparing for them as best we can.

Only then can we ensure AGI benefits all of humanity.

OpenAI must become a safety-first AGI company.

To all OpenAI employees, I want to say:

Learn to feel the AGI.

Act with the gravitas appropriate for what you're building.

I believe you can "ship" the cultural change that's needed.

I am counting on you.

The world is counting on you.

:openai-heart:

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @janleike

Jan Leike

@janleike

Apr 10

The superalignment fast grants are now decided!

We got a *ton* of really strong applications, so unfortunately we had to say no to many we're very excited about.

There is still so much good research waiting to be funded.

Congrats to all recipients!

We just sent emails notifying all applicants. I'm sorry if your application didn't make it in, please don't let this dissuade you from pursuing research in alignment!

Some statistics on the superalignment fast grants:

We funded 50 out of ~2,700 applications, awarding a total of $9,895,000.

Median grant size: $150k
Average grant size: $198k
Smallest grant size: $50k
Largest grant size: $500k

Grantees:
Universities: $5.7m (22)
Graduate students: $3.6m (25)
Nonprofits: $250k (1)
Individuals: $295k (2)

Research areas funded (some proposals cover multiple areas, so this sums to >$10m):
Weak-to-strong generalization: $5.2m (26)
Scalable oversight: $1m (5)
Top-down interpretability: $1.9m (9)
Mechanistic interpretability: $1.2m (6)
Chain-of-thought faithfulness: 700k (2)
Adversarial robustness 650k (4)
Data attribution: 300k (1)
Evals/prediction: 700k (4)
Other: $1m (6)

Read 5 tweets

Jan Leike

@janleike

Mar 12

Today we're releasing a tool we've been using internally to analyze transformer internals - the Transformer Debugger!

It combines both automated interpretability and sparse autoencoders, and it allows rapid exploration of models without writing code.

It supports both neurons and attention heads.

You can intervene on the forward pass by ablating individual neurons and see what changes.

In short, it's a quick and easy way to discover circuits manually.

This is still an early stage research tool, but we are releasing to let others play with and build on it!

Check it out:

github.com/openai/transfo…

Read 4 tweets

Jan Leike

@janleike

Dec 18, 2023

I'm very excited that today OpenAI adopts its new preparedness framework!

This framework spells out our strategy for measuring and forecasting risks, and our commitments to stop deployment and development if safety mitigations are ever lagging behind.

openai.com/safety/prepare…

Some highlights:

1. Safety baselines: There are 4 risk levels: low, medium, high, and critical. High prevents deployment, critical prevents further development.

2. Measurements: We measure dangerous capabilities continually, at least every 2x effective compute increase.

3. Security: Our security measures must be sufficient to prevent safety mitigations from being circumvented via model exfiltration.

4. Alignment: We have to show that our models never initiate critical-level tasks without explicitly being instructed to do so.

Read 7 tweets

Jan Leike

@janleike

Dec 14, 2023

Super excited about our new research direction for aligning smarter-than-human AI:

We finetune large models to generalize from weak supervision—using small models instead of humans as weak supervisors.

Check out our new paper:
openai.com/research/weak-…

For lots of important tasks we don't have ground truth supervision:
Is this statement true?
Is this code buggy?

We want to elicit the strong model's capabilities on these tasks without access to ground truth.

This is pretty central to aligning superhuman models.

We find that large models generally do better than their weak supervisor (a smaller model), but not by much.

This suggests reward models won't be much better than their human supervisors.

In other words: RLHF won't scale.

Read 8 tweets

Jan Leike

@janleike

May 9, 2023

Really exciting new work on automated interpretability:

We ask GPT-4 to explain firing patterns for individual neurons in LLMs and score those explanations.

openai.com/research/langu…

This is a new direction for simultaneously getting two things that we want out of interpretability:

1. understanding the model at the level of detail of individual neurons

2. running over the entire model so we don't miss anything important

What's exciting is that it gives us a way to measure how good an neuron explanation is: we simulate how well a human would predict future firing pattern and compare this to actual firing patterns.

Right now this measure isn't that accurate, but it'll improve with better LLMs!

Read 7 tweets

Jan Leike

@janleike

Aug 24, 2022

How will we solve the alignment problem for AGI?

I've been working on this question for almost 10 years now.

Our current path is very promising:
openai.com/blog/our-appro…

1/

1. We're training AI using human feedback

By optimizing for human preferences we get systems that understand what humans intent and try to achieve it.

Best example: InstructGPT
openai.com/blog/instructi…

2. We're training AI to assist human evaluation

This is critical to scale learning from feedback since it enables humans to state their preferences even on very complex tasks that they otherwise don't understand well.

Best example: AI-written critiques
openai.com/blog/critiques

Read 5 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Jan Leike

Try unrolling a thread yourself!

More from @janleike

Jan Leike

Jan Leike

Jan Leike

Jan Leike

Jan Leike

Jan Leike

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!