Profile picture
Shimon Whiteson @shimon8282
, 11 tweets, 2 min read Read on Twitter
So...I just read the "World Models" paper (arxiv.org/abs/1803.10122) from Ha & Schmidhuber. This is a nicely written, well researched paper with some cool/fun results. It also has a solid related work section and does a decent job putting the work into context.
These are important strengths (and ones that are sadly not reliably present in a lot of recent deep learning papers). And yet...I don't like the approach proposed in this paper and feel pretty strongly that it is not the right way forward.
The main problem is the lack of supervision in the encoder. One of the fundamental challenges in scaling RL is feature construction, and it seems self-evident that this cannot in general be done without supervision.
It's easy to find problematic examples for an unsupervised approach: an agent driving by an irrelevant but dynamic background; or an agent trying to avoid a bullet whose presence changes only one pixel in its observation.
The point is that the efficacy of any feature often depends critically on the task. While I'm all for using auxiliary tasks/losses, these should supplement, not replace, the true task/loss.
The proposed approach doesn't solve this problem at all, but instead avoids it by focusing on tasks in which irrelevant distractions are apparently not very severe.
To their credit, the authors fully acknowledge the issue (paragraph 3 in Section 7) but present it as a "limitation" whereas in my opinion it is a fatal flaw. Let me try to explain why.
The proposed approach is not terribly novel, and the main thing distinguishing it from conventional model-based RL is the idea of putting large model capacity on the encoder/predictor, so that the actual control part can be low dimensional.
This is a clever strategy, and lets them optimise the controller using a black-box optimiser (CMA-ES) which has some advantages but probably wouldn't scale to high dimensions. But a "separation of concerns" approach only works if the concerns really are separable!
My point is that, in this case, they fundamentally are not separable. Learning good features requires some task-specific training!
You can make the training task-specific by pushing the reward signal through the encoder/predictor, but then you no longer have a separation of concerns and you haven't reduced the dimensionality of the control problem at all.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Shimon Whiteson
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!