In RL, "implicit regularization" that helps deep learning find good solutions can actually lead to huge instability. See @aviral_kumar2 talk on DR3:
7/23 4pm PT RL for real: icml.cc/virtual/2021/w…
7/24 5:45pm PT Overparameterization WS talk icml.cc/virtual/2021/w…
#ICML2021

🧵> Image
You can watch the talk in advance here:
And then come discuss the work with Aviral at the poster sessions! This work is not released yet, but it will be out shortly.

We're quite excited about this result, and I'll try to explain why.
Deep networks are overparameterized, meaning there are many parameter vectors that fit the training set. So why does it not overfit? While there are many possibilities, they all revolve around some kind of "implicit regularization" that leads to solutions that generalize well.
One might surmise the same will be true in deep RL: deep RL will work well because it enjoys the same implicit regularization as supervised learning, whatever that might be -- seems reasonable, right?

But deep RL seems really finnicky, unstable, and often diverges...
We provide both empirical and theoretical analysis that suggests the possibility that the face implicit regularization that makes supervised learning work actually *harms* RL with TD backups, due to bootstrap updates producing time-correlated features.
This is actually very bad, because time-correlated features alias good actions to bad actions, and lead to horrible solutions. Fortunately, once we recognize this, we can "undo" the bad part of implicit regularization with good *explicit* regularization, which we call DR3.
This helps across the board for offline RL, for different offline RL algorithms, improves the stability of these methods (making it easier to pick the # of gradient steps), and mitigates the implicit under-parameterization effects that we analyzed in our prior work. Image
The workshop paper is here: drive.google.com/file/d/1Fg43H5…

We'll be releasing a full-length paper on arxiv after a few finishing touches and revisions after the workshop.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sergey Levine

Sergey Levine Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svlevine

25 Jul
An "RL" take on compression: "super-lossy" compression that changes the image, but preserves its downstream effect (i.e., the user should take the same action seeing the "compressed" image as when they saw original) sites.google.com/view/pragmatic…

w @sidgreddy & @ancadianadragan

🧵> Image
The idea is pretty simple: we use a GAN-style loss to classify whether the user would have taken the same downstream action upon seeing the compressed image or not. Action could mean button press when playing a video game, or a click/decision for a website. Image
The compression itself is done with a generative latent variable model (we use styleGAN, but VAEs would work great too, as well as flows). PICO basically decides to throw out those bits that it determines (via its GAN loss) won't change the user's downstream decision.
Read 6 tweets
23 Jul
RAIL will be presenting a number of exciting late breaking poster results at the RL4RealLife WS #ICML2021 (8 pm PT today!): sites.google.com/view/RL4RealLi…

Algorithms for real-world RL w/ mobile manipulators, lifelong meta-learning methods, principled multi-task data sharing.

A thread:
We'll show how RL can control robots that learn to clean up a room, entirely in the real world. By Charles Sun, @ColinearDevin, @abhishekunique7, @jendk3r, @GlenBerseth. Image
We'll present CoMPS, an algorithm for online continual meta-learning, where an agent meta-learns tasks one by one, with each task accelerating future tasks. By @GlenBerseth, WilliamZhang365, @chelseabfinn Image
Read 4 tweets
18 Jul
Can we devise a more tractable RL problem if we give the agent examples of successful outcomes (states, not demos)? In MURAL, we show that uncertainty-aware classifiers trained with (meta) NML make RL much easier. At #ICML2021
arxiv.org/abs/2107.07184

A (short) thread:
The website has a summary: sites.google.com/view/mural-rl

If the agent gets some examples of high reward states, we can train a classifier to automatically provide shaped rewards (this is similar to methods like VICE). A standard classifier is not necessarily well shaped.
This is where the key idea in MURAL comes in: use normalized max likelihood (NML) to train a classifier that is aware of uncertainty. Label each state as either positive (success) or negative (failure), and use the ratio of likelihoods from these classifiers as reward!
Read 7 tweets
18 Jul
Since many people were interested in our recent offline MBO work, I'll also write about a recent paper on MBO by Justin Fu, which trains forward models for each possible objective value and uses them to compute a posterior via NML: arxiv.org/abs/2102.07970

A thread:
The basic idea, unlike COMs (which learn pessimistic models) is to get a posterior over values for a new design x. Justin's method (NEMO) trains a separate model *for every possible value y* for the design x (discretized), and uses the likelihood from these to get the posterior.
This corresponds to the normalized maximum likelihood (NML) distribution, which has appealing regret guarantees, which we extend in NEMO to provide regret guarantees on offline MBO as well! This is more complex than COMs, but potentially more powerful as we get a full posterior.
Read 4 tweets
16 Jul
Data-driven design is a lot like offline RL. Want to design a drug molecule, protein, or robot? Offline model-based optimization (MBO) tackles this, and our new algorithm, conservative objective models (COMs) provides a simple approach: arxiv.org/abs/2107.06882

A thread:
The basic setup: say you have prior experimental data D={(x,y)} (e.g., drugs you've tested). How to use it to get the best drug? Well, you could train a neural net f(x) = y, then pick the best x. This is a *very* bad idea, because you'll just get an adversarial example!
This is very important: lots of recent work shows how to train really good predictive models in biology, chemistry, etc. (e.g., AlphaFold), but using these for design runs into this adversarial example problem. This is actually very similar to problems we see in offline RL!
Read 8 tweets
14 Jul
Empirical studies observed that generalization in RL is hard. Why? In a new paper, we provide a partial answer: generalization in RL induces partial observability, even for fully observed MDPs! This makes standard RL methods suboptimal.
arxiv.org/abs/2107.06277

A thread:
Take a look at this example: the agent has a multi-step "guessing game" to label an image (not a bandit -- you get multiple guesses until you get it right!). We know in MDPs there is an optimal deterministic policy, so RL will learn a deterministic policy here. Image
Of course, this is a bad idea -- if it guesses wrong on the first try, it should not guess the same label again. But this task *is* fully observed -- there is a unique mapping from image pixels to labels, the problem is that we just don't know what it is from training data!
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(