Profile picture
Shimon Whiteson @shimon8282
, 20 tweets, 3 min read Read on Twitter
Next up on my summer reading list: GAN Q-Learning (arxiv.org/abs/1805.04874), a recent addition to the growing trend of distributional reinforcement learning. I find this trend intriguing and potentially quite exciting.
Unfortunately, this trend has also generated a lot of confusion. I keep hearing people talk about distributional RL as if it could help us model the agent's uncertainty about what it's learning, a clearly desirable feature that is missing in many canonical algorithms.
But that's simply not so: distributional RL models aleatoric uncertainty, not epistemic uncertainty. The latter stems from lack of knowledge about the world, modelled, e.g., with a prior/posterior. The former stems from "inherent" stochasticity, e.g., in transition dynamics.
It's easy to get these confused, but it's also easy to tell them apart. Will it concentrate in the limit of infinite data? Then it's epistemic uncertainty. Is it a stationary distribution describing a fixed property of the world? Then it's aleatoric uncertainty.
The GAN Q-Learning paper seems confused between these two concepts. It says "By specifying the whole distribution of state values, the agent conserves all important information about its uncertainty of the environment, leading in expectation to more considerate action choices."
It's not true! It doesn't "conserve all important information about its uncertainty" because it doesn't model epistemic uncertainty at all.
In the canonical setting, where we aim to maximise expected cumulative discounted return, it won't lead to "more considerate action choices", except indirectly, i.e., learning the distribution of returns seems to help learning the expected return.
The reasons for this are poorly understood, though people have started trying to figure it out (see proceedings.mlr.press/v80/imani18a.h…).
But the point is, if it helps, it is only by improving our estimates of expected return. In the canonical setting, optimal behaviour is defined wrt the expectation, and you don't need to know the distribution of returns to behave optimally.
Of course, we might have a risk-sensitive objective, in which case we explicitly care about, e.g., the variance of the returns, and these are the settings in which distributional methods have historically been of interest.
But then we need to optimise our policy for this different objective, and the max operator in GAN Q-Learning (or any form of Q-learning) wouldn't be appropriate (more on this operator below).
The paper also cites some Bayesian RL methods such as GPTD in the context of methods that fit a value distribution. But these methods are completely different! They model the agent's epistemic uncertainty about the value function, not the aleatoric uncertainty about the return.
Nonetheless, I like the idea behind this paper. Distributional RL seems to be quite useful, and GANs are an effective way to model distributions, so it would be great to get the two working together.
The main challenge that I see is that the generative model produced by a GAN only yields samples, and gives no easy way to compute the expected value of the distribution it models.
These expectations are important, because Q-learning updates involve maximising these expectations wrt the actions. Remember: even if we are learning the distribution over returns, the optimal policy is still defined wrt expected return.
As far as I can tell, GAN Q-Learning addresses this by replacing a max over expectations with a max over samples, i.e., it samples one return from the generative model and then takes the max of the sampled values.
This is an intriguing idea, perhaps the most important idea in the paper, but I can't find any explicit motivation for it. It definitely is not computing the max that Q-learning would compute.
It's tempting to motivate it by analogy to Thompson sampling, and indeed it looks quite a bit like Thompson sampling. However, let me be clear: it is absolutely nothing like Thompson sampling.
Thompson sampling works by sampling from posteriors over the action-values, i.e., models of epistemic uncertainty, to guide exploration. By contrast, GAN Q-Learning samples from models of aleatoric uncertainty and thus could never provide more than undirected exploration.
If anything, it seems more related to soft Q-learning, and perhaps a motivation could be borrowed from that approach. Regardless, it might be a great idea, though if so, I don't yet understand why. I'd be interested to hear any motivational arguments anyone cares to propose!
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Shimon Whiteson
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!