Agents with a self-attention “bottleneck” not only can solve these tasks from pixel inputs with only 4000 parameters, but they are also better at generalization!
article attentionagent.github.io
pdf arxiv.org/abs/2003.08165
Read on 👇🏼
The agent receives the full input, but we force it to see its world through the lens of a self-attention bottleneck which picks only 10 patches from the input (middle)
The controller's decision is based only on these patches (right)
Trained in the top-left setting only, it can also perform in unseen settings with higher walls, different floor textures, or when confronted with a distracting sign.
Without further training, we also test on brighter/darker scenery, or with artifacts such as side bars or background blob
Of course not!
If we modify the game by adding a fake lane next to the real lane, the agent prefers to look there and drive over instead—something human drivers with logical reasoning won't do, unless they're in another country!
Some fun failure cases in the Discussion section:
When we suddenly replace the green background with a YouTube cat video, it stops to look at the cat's fat belly, rather than focus on the road🐈
Even if we train our agent from scratch in a noisy background setting, it still attends only to the noise and not to the road.
Surprisingly, it learns to interpret those points as obstacles, and by avoiding them, still manages to wobble through the track!
But when we decrease K to 5, it still attends to noise rather than to the road. Not surprisingly, if we increase K to 20, it performs better.
CarRacingNoise-v0 will make a nice benchmark task.