How does exploration vs exploitation affect reward estimation? Excited to share our #AISTATS2022 paper that constructs optimal reward estimators by leveraging the demonstrator's behavior en route to optimality.
🧵:
Exploration vs exploitation is key to any no-regret learners in bandits. We derive an information-theoretic lower bound that applies to any demonstrator, which shows a quantitative tradeoff between exploration and reward estimation. 2/
Such a tradeoff immediately implies that estimation is impossible in the absence of exploration —e.g. assuming the execution of only an optimal policy— which is precisely the well-known identifiability issue in inverse RL. 3/
Can we construct optimal estimators that achieve this lower bound? We show that the answer is YES to two popular families of learning algorithms: successive arm eliminations (SAE), which is non-adaptive; and upper confidence bound (UCB), which is adaptive. 4/
Our estimators are simple and directly based on the sequence of actions performed by the learner. These show that for either type of demonstrator, exploration can be optimally leveraged in reward estimation, even though the exploration schedule takes different forms. 5/
We test these estimators extensively on synthetic simulations as well as simulators for science domains (eg, battery charging and gene expression) to confirm these results: more observations & exploration → better reward estimation. 6/
Full details: arxiv.org/abs/2106.14866
Code: github.com/wenshuoguo/inv…
Wonderful collaboration rooted at Simons Institute with Wenshuo Guo, Kumar Krishna Agrawal, Vidya Muthukumar, Ashwin Pananjady! 7/7