Research scientist on the scalable alignment team at Google DeepMind. All views are my own.
Dec 6 • 9 tweets • 3 min read
1) AIs are trained as black boxes, making it hard to understand or control their behavior. This is bad for safety! But what is an alternative? Our idea: train structure into a neural network by configuring which components update on different tasks. We call it "gradient routing." 2) With gradient routing, the user gets to decide which parameters update for each data point. By sending a particular kind of data to a particular network region, the user can induce specialization within the network. This enables a novel kind of neural net supervision.
Dec 6, 2021 • 5 tweets • 3 min read
#NeurIPS2021 spotlight: Optimal policies tend to seek power.
Consider Pac-Man: Dying traps Pac-Man in one state forever, while staying alive lets him do more things. Our theorems show that for this reason, for most reward functions, it’s optimal for Pac-Man to stay alive. 🧵:
We show this formally through *environment symmetries*. In this MDP, the visualized state permutation ϕ shows an embedding of the “left” subgraph into the “right” subgraph. The upshot: Going “right” leads to more options, and more options -> more ways for “right” to be optimal.