At #ICML today: why is generalization so hard in value-based RL? We show that the TD targets used in value-based RL evolve in a structured way, and that this encourages neural networks to ‘memorize’ the value function.
📺 icml.cc/virtual/2022/p…
📜 proceedings.mlr.press/v162/lyle22a.h…
TL;DR: reward functions in most benchmark MDPs don’t look much like the actual value function — in particular, the smooth* components of the value function tend to be missing!
*smooth ~= doesn't change much between adjacent states, e.g. a constant function.