Latest Twitter Threads by @Turn_Trout on Thread Reader App

Mar 2 • 5 tweets • 1 min read

I’m worried that “doom” speculation will make doom more likely. Specifically, AIs conform to our expectations of them, as communicated by their training data. This “self-fulfilling misalignment data” may be poisoning training already. 🧵

We’ve known about the problem for a while—I’ve been beating this drum since 2023. It’s long past time that we act & run experiments to isolate the causes and effects.

Dec 6, 2024 • 9 tweets • 3 min read

1) AIs are trained as black boxes, making it hard to understand or control their behavior. This is bad for safety! But what is an alternative? Our idea: train structure into a neural network by configuring which components update on different tasks. We call it "gradient routing."

2) With gradient routing, the user gets to decide which parameters update for each data point. By sending a particular kind of data to a particular network region, the user can induce specialization within the network. This enables a novel kind of neural net supervision.

Dec 6, 2021 • 5 tweets • 3 min read

#NeurIPS2021 spotlight: Optimal policies tend to seek power.

Consider Pac-Man: Dying traps Pac-Man in one state forever, while staying alive lets him do more things. Our theorems show that for this reason, for most reward functions, it’s optimal for Pac-Man to stay alive. 🧵:

We show this formally through *environment symmetries*. In this MDP, the visualized state permutation ϕ shows an embedding of the “left” subgraph into the “right” subgraph. The upshot: Going “right” leads to more options, and more options -> more ways for “right” to be optimal.

Share this page!

Enter URL or ID to Unroll