Boaz Barak Profile picture
Sep 23 9 tweets 3 min read Read on X
1/ Our paper on scheming with @apolloaievals is now on arXiv. A 🧵with some of my take aways from it. Image
2/ Deliberative alignment worked surprisingly well! We saw 30x reduction in rate of covert action from 13% to 0.4% in our test environments.

Note: These test environment are designed to elicit covert action, and so 13% is *not* the baseline in normal production. Image
3/ Situational awareness is becoming a real issue. This plot shows how just training models for capabilities can increase their awareness that they are being evaluated.

This is only going to be harder as eval environments become more complex, including interacting with external environments.Image
4/ COT show the model engaging in complex "meta level" strategies and sometimes wants to cheat but outputs right answer because it suspects it is being evaluated. Image
5/ We can learn a lot from the COT! E.g. here o3 is trying to rationalize sandbagging Image
6/ o3 developed its own "dialect" in the chain of thought including "marinade", "overshadow", "illusions", "cunning", "fudge", "vantage". I have to admit that even when it cheats, I find it kind of cute.... Image
7/7 See the paper for more! arxiv.org/abs/2509.15541
Thanks to amazing collaborators!
Bronson Schoen @balesni @jeremy_scheurer @AlexMeinke @w01fe @Teun_vd_Weij Alex Lloyd, Nicholas Goldowsky-Dill @angelayfan Andrei Matveiakin @AxelHojmark @HofstatterFelix @mia_glaese @woj_zaremba @MariusHobbhahn @j_nitishinskaya @rushebshah
@balesni @jeremy_scheurer @AlexMeinke @w01fe @Teun_vd_Weij @angelayfan @AxelHojmark @HofstatterFelix @mia_glaese @woj_zaremba @MariusHobbhahn @j_nitishinskaya @rushebshah @threadreaderapp unroll

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Boaz Barak

Boaz Barak Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @boazbaraktcs

Jul 15
I didn't want to post on Grok safety since I work at a competitor, but it's not about competition.

I appreciate the scientists and engineers at @xai but the way safety was handled is completely irresponsible. Thread below.
I can't believe I'm saying it but "mechahitler" is the smallest problem:

* There is no system card, no information about any safety or dangerous capability evals.
* Unclear if any safety training was done. Model offers advice chemical weapons, drugs, or suicide methods.
* The "companion mode" takes the worst issues we currently have for emotional dependencies and tries to amplify them.

lesswrong.com/posts/dqd54wpE…
This is not about competition. Every other frontier lab - @OpenAI (where I work), @AnthropicAI, @GoogleDeepMind, @Meta at the very least publishes a model card with some evaluations. Even DeepSeek R1, which can be easily jailbroken, at least sometimes requires jailbreak. (And unlike DeepSeek, Grok is not open sourcing their model.)
Read 4 tweets
Dec 21, 2024
1/5 Excited that our paper on "deliberative alignment" came out as part of 12 days of @openai! By teaching reasoning models the text of our specifications, and how to reason about them in context, we obtain significantly better robustness while also reducing over refusals. 🧵 Image
2/5 Traditionally, AI models are just trained with (input, good response, bad response) data, but they are not taught to reason *why* these responses are good or bad. This teaches good "system 1" instincts, but these can fail in new situations. "System 2" allows model to adapt, e.g. when input is encoded.Image
Image
Image
3/5 In alignment reasoning, models are trained to reason about safety specifications in context which provides strong out-of-distribution performance. E.g., can handle encoded requests even when trained without such data. Image
Read 5 tweets
May 3, 2022
1/5 A blog post/book review on history&philosophy of science, reviewing Weinberg's "To Explain The World" and Strevens' "The Knowledge Machine" windowsontheory.org/2022/05/03/phi…

Trigger warning: I compare science to the blockchain, and find positive aspects in the infamous "reviewer 2" 😀
2/5 I found both books fascinating, and recommend reading them. Both focus on roughly the history between Aristotle to Newton, and show that many "simple stories" are more complex than at least I knew before.
3/5 Two examples:
* Copernicus' helio-centric theory was actually worse at predictions than the geo-centric theory of Ptolemy that came before it.

* Addington's 1919 confirmation of Einstein's general relativity involved a lot of "subjective interpretation" of telescope images.
Read 5 tweets
Apr 5, 2022
1) Jo Boaler charges Oxnord district (100% minority 86.9% economically disadvantaged) $5000 per hour for (dubious, but that's another story) "professional development".

2) Jelani Nelson is outraged, points out he spent 1000s unpaid hours on minority education initiatives.
3) He tweets Boaler's public contract with a public school district, which is available on their website.

4) Boaler emails him claiming he is "sharing private details" and "spreading misinformation" about her. She tells him that this is "taken up by police and lawyers".
While I find charging a public school district $5000/hour egregious, I think the main harm to minority and less resourced students would come from adopting Prof Boaler's recommendations.

See this document and the links in it for more gdoc.pub/doc/e/2PACX-1v…
Read 6 tweets
Jan 21, 2022
Worth reading. I don't know if "science" vs "principled ML" is the right terminology, but this does touch upon a real phenomenon.

In many areas of computer science (algorithms, crypto), theory is *ahead* of practice. E.g., consider multiparty secure computation, PCPs, etc (🧵)
They were proposed in 80s and 90s, considered wildly impractical, and only recently began to be implemented and used.

In contrast, in deep learning currently, practice is ahead of theory. Rather than having theoretical proposals that are too complex or inefficient to implement..
..we have practical tools that exhibit too complex a behavior for us to analyze with our theoretical tools.

In that sense, these tools behave more similar to discovered natural objects than to designed algorithms.
Read 5 tweets
Dec 3, 2021
1/14 More than 150 scientists & educators signed open letter raising alarm on efforts to water down K-12 math education

scottaaronson.blog/?p=6146

Signers include Fields, Nobel & Turing laurates, and also founders of HS STEM educational initiatives (eg @adrian_mims, @minilek).
2/14 Specifically California proposed changes to its CMF that encourage schools to drop algebra from middle school, and put obstacles on reaching calculus in high school. They also de-emphasize calculus&algebra in favor of shallow "data science" courses.

bit.ly/cmfanalysis
3/14 These well-intentioned but misguided changes will hurt all students, but mostly those without resources to work around them, as already was the case in San Francisco edsource.org/2021/one-distr…
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(