1/ Our paper on scheming with @apolloaievals is now on arXiv. A 🧵with some of my take aways from it.
2/ Deliberative alignment worked surprisingly well! We saw 30x reduction in rate of covert action from 13% to 0.4% in our test environments.
Note: These test environment are designed to elicit covert action, and so 13% is *not* the baseline in normal production.
3/ Situational awareness is becoming a real issue. This plot shows how just training models for capabilities can increase their awareness that they are being evaluated.
This is only going to be harder as eval environments become more complex, including interacting with external environments.
4/ COT show the model engaging in complex "meta level" strategies and sometimes wants to cheat but outputs right answer because it suspects it is being evaluated.
5/ We can learn a lot from the COT! E.g. here o3 is trying to rationalize sandbagging
6/ o3 developed its own "dialect" in the chain of thought including "marinade", "overshadow", "illusions", "cunning", "fudge", "vantage". I have to admit that even when it cheats, I find it kind of cute....
I didn't want to post on Grok safety since I work at a competitor, but it's not about competition.
I appreciate the scientists and engineers at @xai but the way safety was handled is completely irresponsible. Thread below.
I can't believe I'm saying it but "mechahitler" is the smallest problem:
* There is no system card, no information about any safety or dangerous capability evals.
* Unclear if any safety training was done. Model offers advice chemical weapons, drugs, or suicide methods.
* The "companion mode" takes the worst issues we currently have for emotional dependencies and tries to amplify them.
This is not about competition. Every other frontier lab - @OpenAI (where I work), @AnthropicAI, @GoogleDeepMind, @Meta at the very least publishes a model card with some evaluations. Even DeepSeek R1, which can be easily jailbroken, at least sometimes requires jailbreak. (And unlike DeepSeek, Grok is not open sourcing their model.)
1/5 Excited that our paper on "deliberative alignment" came out as part of 12 days of @openai! By teaching reasoning models the text of our specifications, and how to reason about them in context, we obtain significantly better robustness while also reducing over refusals. 🧵
2/5 Traditionally, AI models are just trained with (input, good response, bad response) data, but they are not taught to reason *why* these responses are good or bad. This teaches good "system 1" instincts, but these can fail in new situations. "System 2" allows model to adapt, e.g. when input is encoded.
3/5 In alignment reasoning, models are trained to reason about safety specifications in context which provides strong out-of-distribution performance. E.g., can handle encoded requests even when trained without such data.
1/5 A blog post/book review on history&philosophy of science, reviewing Weinberg's "To Explain The World" and Strevens' "The Knowledge Machine" windowsontheory.org/2022/05/03/phi…
Trigger warning: I compare science to the blockchain, and find positive aspects in the infamous "reviewer 2" 😀
2/5 I found both books fascinating, and recommend reading them. Both focus on roughly the history between Aristotle to Newton, and show that many "simple stories" are more complex than at least I knew before.
3/5 Two examples:
* Copernicus' helio-centric theory was actually worse at predictions than the geo-centric theory of Ptolemy that came before it.
* Addington's 1919 confirmation of Einstein's general relativity involved a lot of "subjective interpretation" of telescope images.
1) Jo Boaler charges Oxnord district (100% minority 86.9% economically disadvantaged) $5000 per hour for (dubious, but that's another story) "professional development".
2) Jelani Nelson is outraged, points out he spent 1000s unpaid hours on minority education initiatives.
3) He tweets Boaler's public contract with a public school district, which is available on their website.
4) Boaler emails him claiming he is "sharing private details" and "spreading misinformation" about her. She tells him that this is "taken up by police and lawyers".
While I find charging a public school district $5000/hour egregious, I think the main harm to minority and less resourced students would come from adopting Prof Boaler's recommendations.
They were proposed in 80s and 90s, considered wildly impractical, and only recently began to be implemented and used.
In contrast, in deep learning currently, practice is ahead of theory. Rather than having theoretical proposals that are too complex or inefficient to implement..
..we have practical tools that exhibit too complex a behavior for us to analyze with our theoretical tools.
In that sense, these tools behave more similar to discovered natural objects than to designed algorithms.
Signers include Fields, Nobel & Turing laurates, and also founders of HS STEM educational initiatives (eg @adrian_mims, @minilek).
2/14 Specifically California proposed changes to its CMF that encourage schools to drop algebra from middle school, and put obstacles on reaching calculus in high school. They also de-emphasize calculus&algebra in favor of shallow "data science" courses.
3/14 These well-intentioned but misguided changes will hurt all students, but mostly those without resources to work around them, as already was the case in San Francisco edsource.org/2021/one-distr…