Ideologies very often end up producing the opposite of what they claim to want. Environmentalism, liberalism, communism, transhumanism, AI safety…
I call this the activist‘s curse. Understanding why it happens is one of the central problems of our time.
Twelve hypotheses:
1. Adverse selection on who participates. The loudest alarm is probably false, and the loudest activist is probably crazy. 2. Entrenchment. Accelerating in one direction creates pushback in the opposite direction, which eventually overpowers you. lesswrong.com/posts/B2CfMNfa…
3. Perverse internal incentives. When your whole identity is wrapped up in a problem, it getting solved is terrifying. 4. Cowardice. If you actually try, failure is your responsibility. But failing without trying lets you still feel good about yourself: astralcodexten.com/p/book-review-…
5. Territoriality and infighting while trying to keep the movement pure. “What makes an outgroup? Proximity plus small differences”: 6. Respectability politics. You become your own side’s harshest critic to curry favor with outsiders.slatestarcodex.com/2014/09/30/i-c…
7. Perverse external incentives. You’ll be rewarded for doing what’s worst for you, because that’s what’s most interesting/funny/outrageous: 8. The dark forest hypothesis. When you’re too loud, the real powers come out and eat you.slatestarcodex.com/2014/12/17/the…
9. Trauma from fighting the world. “If you gaze long into an abyss” you become a cynic. 10. Ossification. Organizations eventually become so burdened by layers of rules, cruft, and “organizational scar tissue” that they’re net negative for their own goals: overcomingbias.com/p/what-makes-s…
11. Goodhart’s law. When you optimize a proxy hard enough, it eventually diverges strongly from what you really care about. 12. “Don’t crash into the tree”: you get more of what you pay attention to. Fear-driven motivation doesn’t work.
Eleven opinions on AI risk that cut across standard worldview lines: 1. The biggest risks are subversion of key institutions and infrastructure (see QT) and development of extremely destructive weapons. 2. If we avoid those, I expect AI to be extremely beneficial for the world.
3. I am skeptical of other threat models, especially ones which rely on second-order/ecosystem effects. Those are very hard to predict. 4. There’s been too much focus on autonomous replication and adaptation; power-seeking “outside the system” is hard. See lesswrong.com/posts/xiRfJApX…
5. “Alignment” is a property of models not a property of research. I support any research that tries to understand neural networks on a deep scientific level. 6. Open-source is super useful for this, and will continue to be net-positive for years more (perhaps indefinitely).
My former colleague Leopold argues compellingly that society is nowhere near ready for AGI. But what might the large-scale alignment failures he mentions actually look like? Here’s one scenario for how building misaligned AGI could lead to humanity losing control. THREAD:
Consider a scenario where human-level AI has been deployed across society to help with a wide range of tasks. In that setting, an AI lab trains an artificial general intelligence (AGI) that’s a significant step up - it beats the best humans on almost all computer-based tasks.
Throughout training, the AGI will likely learn a helpful persona, like current AI assistants do. But that might not be the only persona it learns. We've seen many examples where models can be "jailbroken" to expose very different hidden personas.
So apparently UK courts can decide that two unrelated jobs are “of equal value”.
And people in the “underpaid” job get to sue for years of lost wages.
And this has driven their 2nd biggest city bankrupt.
Am I getting something wrong or is this as crazy as it sounds?
There are a bunch of equal pay cases, but the biggest is against Birmingham City Council, which paid over a billion pounds in compensation because some jobs (like garbage collectors) got bonuses and others (like cleaners) didn’t. Now the city is bankrupt.
In my mind the core premise of AI alignment is that AIs will develop internally-represented values which guide their behavior over long timeframes.
If you believe that, then trying to understand and influence those values is crucial.
If not, the whole field seems strange.
Lately I’ve tried to distinguish “AI alignment” from “AI control”. The core premise of AI control is that AIs will have the opportunity to accumulate real-world power (e.g. resources, control over cyber systems, political influence), and that we need techniques to prevent that.
Those techniques include better monitoring, security, red-teaming, stenography detection, and so on. They overlap with alignment, but are separable from it. You could have alignment without control, or control without alignment, or neither, or (hopefully) both.
Taking artificial superintelligence seriously on a visceral level puts you a few years ahead of the curve in understanding how AI will play out.
The problem is that knowing what’s coming, and knowing how to influence it, are two very very different things.
Here’s one example of being ahead of the curve: “situational awareness”. When the term was coined a few years ago it seemed sci-fi to most. Today it’s under empirical investigation. And once there’s a “ChatGPT moment” for AI agents, it will start seeming obvious + prosaic.
But even if we can predict that future agents will be situationally aware, what should we do about that? We can’t study it easily yet. We don’t know how to measure it or put safeguards in place. And once it becomes “obvious”, people will forget why it originally seemed worrying.