policy for v smart things @openai. Past: CS PhD @HarvardSEAS/@SchmidtFutures/@MIT_CSAIL. Tweets my own; on my head be it.
Mar 10 • 6 tweets • 2 min read
These results are a massive deal, and overhauled the way I think about alignment and misalignment.
I think this suggests a new default alignment strategy.
Results and takeaways 🧵
For current capability levels,
1️⃣ Complex reward hacking already happens in practice in frontier training runs, and the models get extremely creative with their hacks. (I’m glad we’re increasing transparency on this, and hope others follow suit.)
Dec 22, 2024 • 7 tweets • 5 min read
Now that everyone knows about o3, and imminent AGI is considered plausible, I’d like to walk through some of the AI policy implications I see.
These are my own takes and in no way reflective of my employer. They might be wrong! I know smart people who disagree. They don’t require you to share my timelines, and are intentionally unrelated to the previous AI-safety culture wars.
Nov 24, 2023 • 5 tweets • 1 min read
If you are a public figure and tell your followers that “big new risks from advanced AI are fake”, you are wrong.
Not only that, you’ll be seen to be wrong *publicly & soon*.
This is not an “EA thing”, it is an oncoming train and it is going to hit you, either help out or shut up
We are headed for >=1 of:
* Massive job loss & weakening of labor
* Massive cost-cuts to totalitarianism
* Autonomous agents reshaping the [cyber/information] env
* Major acceleration of R&D
* AI systems we cannot trust with power, but are caught in a prisoner’s dilemma to deploy
Jul 5, 2023 • 16 tweets • 5 min read
The data used to train an AI model is vital to understanding its capabilities and risks. But how can we tell whether a model W actually resulted from a dataset D?
In a new paper, we show how to verify models' training-data, incl the data of open-source LMs!arxiv.org/abs/2307.00682
Today, users and auditors have to blindly trust AI developers about the training data used.
But shady devs have incentive to lie, to sell you a weaker model or dodge auditor/regulatory scrutiny.
If auditors could catch any lies about data, we’d start trusting AI audits much more.
Mar 24, 2023 • 32 tweets • 6 min read
If AI models get even better, their unchecked use will begin to pose serious dangers to society.
Most people agree it’d be great if countries could agree on rules to prevent AI misuse/accidents, & avoid an arms race.
But how could rules on AI actually be enforced?
Paper thread:
I wrote a paper on my current best guess for how to do it.
That said, verifying compliance internationally is a messy business. The devil is in the details, and the details aren’t on Twitter.
See the paper for all my caveats!
A huge problem in with coalition-building in AI policy is that policymakers are trained to think on 10 year horizons (because policy is slow), while researchers are trained to think on 3 year horizons (because they need to assess whether their paper will be liked/get scooped)
When policymakers ask the research community “will large AI models be more dangerous than tanks”, researchers will try and predict what PaLM+3y could probably actually do, and then publicly say “no that’s an overreaction”.