Tonight 12/10 9pm PT, Aviral Kumar will present Model Inversion Networks (MINs) at @NeurIPSConf. Offline model-based optimization (MBO) that uses data to optimize images, controllers and even protein sequences!
The problem setting: given samples (x,y) where x represents some input (e.g., protein sequence, image of a face, controller parameters) and y is some metric (e.g., how well x does at some task), find a new x* with the best y *without access to the true function*.
Classically, model-based optimization methods would learn some proxy function (acquisition function) fhat(x) = y, and then solve x* = argmax_x fhat(x), but this can result in OOD inputs to fhat(x) when x is very high dimensional.
This can be mitigated by active sampling (i.e., collecting more data), but this is often not possible in practice (e.g., requires running costly experiments). Or by using Bayesian models like GPs, but these are difficult to scale to high dimensions.
MINs address this issue with a simple approach: instead of learning f(x) = y, learn f^{-1}(y) = x -- here, the input y is very low dimensional (1D!), making it much easier to handle OOD inputs. This ends up working very well in practice.
Why does this problem matter? In many cases we *already* have offline data (e.g., previously synthesized drugs and their efficacies, previously tested aircraft wings and their performance, prior microchips and their speed), so offline MBO uses this data to produce new designs.
In contrast, many alternative methods rely on active sampling of the data -- when we are talking about real world data (e.g., biology experiments, aircraft designs, etc.), each datapoint can be expensive and time-consuming, while offline MBO can reuse the same data.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
My favorite part of @NeurIPSConf is the workshops, a chance to see new ideas and late-breaking work. Our lab will present a number of papers & talks at workshops:
thread below ->
meanwhile here is a teaser image :)
At robot learning workshop, @katie_kang_ will present the best-paper-winning (congrats!!) “Multi-Robot Deep Reinforcement Learning via Hierarchically Integrated Models”: how to share modules between multiple real robots; recording here: (16:45pm PT 12/11)
Greg Kahn's deep RL algorithms allows robots to navigation Berkeley's sidewalks! All the robot gets is a camera view, and supervision signal for when a safety driver told it to stop.
The idea is simple: a person follows the robot in a "training" phase (could also watch remotely from the camera), and stops the robot when it does something undesirable -- much like a safety driver might stop an autonomous car.
The robot then tries to take those actions that are least likely to lead to disengagement. The result is a learned policy that can navigate hundreds of meters of Berkeley sidewalks entirely from raw images, without any SLAM, localization, etc., entirely using a learned neural net
Can we view RL as supervised learning, but where we also "optimize" the data? New blog post by Ben, Aviral, and Abhishek: bair.berkeley.edu/blog/2020/10/1…
The idea: modify (reweight, resample, etc.) the data so that supervised regression onto actions produces better policies. More below:
Standard supervised learning is reliable and simple, but of course if we have random or bad data, supervised learning of policies (i.e., imitation) won't produce good results. However, a number of recently proposed algorithms can allow this procedure to work.
What is needed is to iteratively "modify" the data to make it more optimal than the previous iteration. One way to do this is by conditioning the policy on something about the data, such as a goal or even a total reward value.
Interested in trying out offline RL? Justin Fu's blog post on designing a benchmark for offline RL, D4RL, is now up: bair.berkeley.edu/blog/2020/06/2…
D4RL is quickly becoming the most widely used benchmark for offline RL research! Check it out here: github.com/rail-berkeley/…
An important consideration in D4RL is that datasets for offline RL research should *not* just come from near-optimal policies obtained with other RL algorithms, because this is not representative of how we would use offline RL in the real world. D4RL has a few types of datasets..
"Stitching" data provides trajectories that do not actually accomplish the task, but the dataset contains trajectories that accomplish parts of a task. The offline RL method must stitch these together, attaining much higher reward than the best trial in the dataset.