Deep RL is hard: lots of hparam tuning, instability. Perhaps there is a reason for this? Turns out the same thing that makes supervised deep learning work well makes deep RL work poorly, leading to feature vectors that grow out of control: arxiv.org/abs/2112.04716

Let me explain:
Simple test: compare offline SARSA vs offline TD. TD uses the behavior policy, so same policy is used for the backup, but SARSA uses dataset actions, while TD samples *new* actions (from the same distr.!). Top plot is phi(s,a)*phi(s',a'): dot prod of current & next features.
Well, that's weird. Why do TD feature dot products grow and grow until the method gets unstable, while SARSA stays flat? To understand this, we must understand implicit regularization, which makes overparam models like deep nets avoid overfitting.
When training with SGD, deep nets don't find just *any* solution, but a well regularized solution. SGD finds lower-norm solutions that then generalize well (see derived reg below). We might think that the same thing happens when training with RL, and hence deep RL will work well.
But if we apply similar logic to deep RL as supervised learning, we can derive what the "implicit regularizer" for deep RL looks like. And it's not pretty. First term looks like the supervised one, but the second one blows up feature dot products, just like we see in practice!
In practice, we can simply add some *explicit* regularization on the features to counteract this nasty implicit regularizer. We call this DR3. It simply minimizes these feature dot products. Unlike normal regularizers, DR3 actually *increases* model capacity!
We can simply add DR3 to standard offline RL algorithms and boost their performance, basically without any other modification. We hope that further research on overparameterization in deep RL can shed more light about why deep RL is unstable and how we can fix it!
This work was presented as a full-length oral at the NeurIPS Deep RL workshop: sites.google.com/view/deep-rl-w…
Paper: arxiv.org/abs/2112.04716

Work led by @aviral_kumar2, with @agarwl_, @tengyuma , @AaronCourville, @georgejtucker

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sergey Levine

Sergey Levine Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svlevine

8 Dec
Classifiers can act as value functions, when a task is specified with a set of example outcomes. We call this method RCE. It's simple and works well! Find out more at @ben_eysenbach's poster Wed 12/8 4:30 pm at @NeurIPSConf, and full-length talk Fri 4:20 pm PT (oral sess 5)🧵>
The basic idea: instead of coding up a reward function by hand, provide a few example outcomes (states) that denote "success". RCE trains a classifier, which predicts whether an action will lead to "success" *in the future*
Read 5 tweets
7 Dec
Intrinsic motivation allows RL to find complex behaviors without hand-designed reward. What makes for a good objective? Information about the world can be translated into energy (or rather, work), so can an intrinsic objective accumulate information? That's the idea in IC2. A 🧵: Image
The "Maxwell's demon" thought exercise describes how information translates into energy. In one version, the "demon" opens a gate when a particle approaches from one side, but not the other, sorting them into one chamber (against the diffusion gradient). This lowers entropy. Image
This seems to violate the second law of thermodynamics. The explanation for why it does not is that information about the particles itself is exchangeable with potential energy (that's a gross oversimplifications, but this is just a tweet...).
Read 8 tweets
22 Nov
Reusable datasets, such as ImageNet, are a driving force in ML. But how can we reuse data in robotics? In his new blog post, Frederik Ebert talks about "bridge data": multi-domain and multi-task datasets that boost generalization of new tasks: bair.berkeley.edu/blog/2021/11/1…

A thread:
The reason reusing data in robotics is hard is that everyone has a different lab environment, different tasks, etc. So to be reusable, the dataset needs to be both multi-domain *and* multi-task, so that it enables generalization across domains.
So if I collect a little bit of data for my task in my new domain, can I use a reusable dataset to boost generalization of this task? This is not a trivial question, since the "bridge data" does not contain either the new domain or the new task.
Read 5 tweets
19 Nov
We've updated Trajectory Transformer (Transformers + model-based RL) for NeurIPS camera-ready, now with more complete results that include Ant Maze, including value functions, and also a blog post summarizing the method!
arxiv.org/abs/2106.02039
bair.berkeley.edu/blog/2021/11/1…

A thread:
Trajectory transformer is a "one big dumb model" approach to model-based RL: every single dimension of every state and action is a (discrete) token in a huge sequence. The model doesn't distinguish between states vs actions vs rewards, they're all just tokens.
Although the Transformer is "monolithic", it *discovers* things like near-Markovian attention patterns (left) and a kind of "action smoothing" (right), where sequential actions are correlated to each other. So the Transformer learns about the structure in RL, to a degree.
Read 9 tweets
19 Oct
To make an existing model more robust at test time: augment a single test image in many ways, finetune model so that predictions on augmented images "agree", minimizing marginal entropy. This is the idea behind MEMO (w/ Marvin Zhang & @chelseabfinn): arxiv.org/abs/2110.09506

🧵>
MEMO is a simple test-time adaptation method that takes any existing model (no change during training), and finetunes it on one image:
1. generate augmentations of test image
2. make predictions on all of them
3. minimize marginal entropy of these predictions (make them similar)
This can significantly improve a model's robustness to OOD inputs. Here are examples on ImageNet-C where MEMO fixes mistakes the model would have made without MEMO. This doesn't involve additional assumptions, the training is exactly the same, and it operates on one test image.
Read 4 tweets
23 Sep
Offline RL lets us run RL without active online interaction, but tuning hyperparameters, model capacity etc. still requires rollouts, or validation tasks. In our new paper, we propose guidelines for *fully offline* tuning for algorithms like CQL arxiv.org/abs/2109.10813

A thread:
In supervised learning, we have training and validation sets, and this works great for tuning. There is no equivalent in RL, making tuning hard. However, with CQL, when there is too much or too little model capacity, we do get very characteristic estimated vs true Q-value curves ImageImage
Of course, the true return is unknown during offline training, but we can still use our understanding of the trends of estimated Q-values to provide guidelines for how to adjust model capacity. These guidelines are not guaranteed to work, but seem to work well in practice. ImageImageImage
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(