Julian Minder Profile picture
May 20 10 tweets 3 min read Read on X
New blog!
Synthetic Persona Pretraining (SPP): Alignment from Token Zero

Current alignment is shallow - values bolted on after pretraining can be routed around. To solve this, we wrote the desired persona directly into pretraining data. Early results, but we're very excited. 🧵 Image
The Persona Selection Model posits that post-training picks from personas that pretraining already fixed, it doesn't build new ones. So if your pretraining corpus is a mess, no amount of post-training will save you.

(2/10)
We append moral reflections grounded in a value constitution to 10% of pretraining docs.
Harmful doc → reflection says what's wrong and why. Benign doc → reflection notes what's fine.
The model learns to reason morally, not just pattern-match refusals.
(3/10) Image
We train a 1.7B model on 100B tokens from the Olmo3 pretraining mix using SPP, and find that alignment from token zero results in the safest model.

This aligns very well with result from @GeodesResearch and SafeLM.
(4/10)
We identify the persona binding problem: post-training doesn't automatically pick up the values learned in pretraining. But by making the post-training data more similar to the reflections, we see values generalize across the distribution gap between the two stages. (5/10)
We verify persona binding by holding out post-training data about certain moral values. The SPP model still correctly refers to those values after value-filtered post-training, a behavior not shown by models pretrained without reflections.
(6/10) Image
Other findings:

- Distributional alignment between reflections and post-training matters a lot
- Filtering toxic data makes models less safe
- Random placement of reflections beats end-of-doc
- Reflections in 1st person > 3rd person
- Starting from token 0 > midtraining
(7/10) Image
We've been working on this for a while and are very excited to scale it to 3B / 500B and beyond (Apertus👀). Huge thanks to my co-first authors @ragghhavvv @Vitya_Vitalich and the one and only @cervisiarius! We're only at the start – feedback greatly appreciated!
(8/10)
Big thanks to the whole team: Difan Jiao, Kartik Bali, @iderigun_, @stefkrsteski, @ashton1anderson, Roland Aydin.

Post:
(9/10)lesswrong.com/posts/3xQQK9i8…
Tagging: @zicokolter @Jack_W_Lindsey @NeelNanda5 @BlancheMinerva @EthanJPerez @sleepinyourhat @nostalgebraist @repligate @natolambert @OwainEvans_UK @DavidDAfrica @soldni
(10/10)

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Julian Minder

Julian Minder Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(