tl;dr: Our findings call for a change in how we evaluate performance on deep RL benchmarks, for which we present more reliable protocols, easily applicable with *even a handful of runs*, to prevent unreliable results from stagnating the field.

arxiv.org/abs/2108.13264 (1/N) Image
Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. (2/N) Image
Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a handful of runs per task, exacerbating the statistical uncertainty in point estimates. (3/N) Image
Using a case study on Atari 100k, we show that commonly reported point estimates of median scores:
(1) exhibit high variability,
(2) are substantially biased,
(2) require 30-50 runs to claim improvement for certain comparisons, infeasible for most research projects! (4/N) ImageImage
We also comment on the incompatibility of alternative evaluation protocols involving maximum across runs or during training to end-performance results. On Atari 100k, we find that the two protocols produce substantially different results. (5/N) Image
So, how do we reliably evaluate performance? We argue that just fixing random seeds is not a solution. Similarly, evaluating more than a few runs is typically infeasible for computationally demanding deep RL benchmarks. (6/N) ImageImage
Instead of dichotomous statistical significance tests, we emphasize using statistical thinking. As such, performance estimate based on a finite number of runs is a random variable and should be treated as such. (7/N) Image
Following [1, 2], we recommend using confidence intervals for measuring the uncertainty in results and showing performance improvements over baseline that are compatible with the given data. (8/N)

[1] nature.com/articles/d4158…
[2]tandfonline.com/doi/full/10.10…
Our proposals are summarized in the table below. We'll walk through them one by one. (9/N) Image
The key trick is with N runs per task on each of M tasks, we have a total of NM random samples. Bootstrapping seems to work well in this scenario for providing reasonably accurate interval estimates. We define interquartile mean (IQM) and optimality gap across runs below. (10/N) Image
Aggregate metrics conceal variability in performance across runs and tasks, we instead recommend reporting performance profiles. (11/N)
Performance profiles are typically used for comparing solve times of different optimization methods and are robust to outlier runs/tasks. Our profiles correspond to empirical tail distribution function with confidence
bands based on stratified bootstrap. (12/N) Image
However, perf profiles often intersect and we need aggregate metrics for quantitative comparisons. Existing metrics are deficient: mean is prone to outliers while median is unaffected by 0 scores on nearly half of the tasks and is statistically inefficient. Our proposals: Image
*IQM* corresponds to mean score of the middle 50% of the *runs* combined across all tasks. Compared to mean, it is robust to outliers. Compared to median, is a better indicator of overall performance and results in smaller CIs and require fewer runs to claim improvements. (14/N) Image
On DM Control, there are huge overlaps in 95% CIs of mean normalized scores for most algorithms. These findings suggest that a lot of the reported improvements could be spurious, resulting from randomness in the experimental protocol. (15/N) Image
Another metric we propose is *probability of improvement* averaged across all tasks -- this answers how likely the improvement over baseline is irrespective of its size! Applying this on Procgen, we find that some of the claimed improvements are only 50-70% likely. (15/N) Image
To support reliable evaluation in RL research, we have released an easy-to-use Python library, along with a Colab. See arxiv.org/abs/2108.13264 for more results and findings!

Colab: bit.ly/statistical_pr…
Library: github.com/google-researc…
Individual runs: console.cloud.google.com/storage/browse…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Rishabh Agarwal

Rishabh Agarwal Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(