Julian Minder Profile picture
PhD at EPFL with Robert West and Ryan Cotterell, MATS 7 Scholar with Neel Nanda

May 20, 10 tweets

New blog!
Synthetic Persona Pretraining (SPP): Alignment from Token Zero

Current alignment is shallow - values bolted on after pretraining can be routed around. To solve this, we wrote the desired persona directly into pretraining data. Early results, but we're very excited. 🧵

The Persona Selection Model posits that post-training picks from personas that pretraining already fixed, it doesn't build new ones. So if your pretraining corpus is a mess, no amount of post-training will save you.

(2/10)

We append moral reflections grounded in a value constitution to 10% of pretraining docs.
Harmful doc → reflection says what's wrong and why. Benign doc → reflection notes what's fine.
The model learns to reason morally, not just pattern-match refusals.
(3/10)

We train a 1.7B model on 100B tokens from the Olmo3 pretraining mix using SPP, and find that alignment from token zero results in the safest model.

This aligns very well with result from @GeodesResearch and SafeLM.
(4/10)

We identify the persona binding problem: post-training doesn't automatically pick up the values learned in pretraining. But by making the post-training data more similar to the reflections, we see values generalize across the distribution gap between the two stages. (5/10)

We verify persona binding by holding out post-training data about certain moral values. The SPP model still correctly refers to those values after value-filtered post-training, a behavior not shown by models pretrained without reflections.
(6/10)

Other findings:

- Distributional alignment between reflections and post-training matters a lot
- Filtering toxic data makes models less safe
- Random placement of reflections beats end-of-doc
- Reflections in 1st person > 3rd person
- Starting from token 0 > midtraining
(7/10)

We've been working on this for a while and are very excited to scale it to 3B / 500B and beyond (Apertus👀). Huge thanks to my co-first authors @ragghhavvv @Vitya_Vitalich and the one and only @cervisiarius! We're only at the start – feedback greatly appreciated!
(8/10)

Big thanks to the whole team: Difan Jiao, Kartik Bali, @iderigun_, @stefkrsteski, @ashton1anderson, Roland Aydin.

Post:
(9/10)lesswrong.com/posts/3xQQK9i8…

Tagging: @zicokolter @Jack_W_Lindsey @NeelNanda5 @BlancheMinerva @EthanJPerez @sleepinyourhat @nostalgebraist @repligate @natolambert @OwainEvans_UK @DavidDAfrica @soldni
(10/10)

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling