Decision Transformer is just a GPT model conditioned on desired returns. Returns, states and actions are fed into the model like tokens in a sentence (trajectory).
At evaluation time, specify the desired episode return and sequentially sample actions to get your policy.
2/8
For simplicity, we consider offline RL setting (although we aren't limited to this).
In offline RL, we train on a fixed dataset of collected experience, mimicking language modeling setup and enabling data-driven behavior learning. But this isn't just imitation learning...
3/8
Like Q-learning algorithms, Decision Transformer can "stitch" together subsequences from distinct training examples - just with a sequence modeling objective!
When trained only on random walks over a graph, Decision Transformer learns to generate an optimal shortest path:
4/8
On commonly studied offline RL benchmarks, we find this simple idea of sequence modeling with a scalable transformer model performs on par (or better) than SoTA model-free offline RL algorithms!
5/8
Unlike traditional RL methods that learn narrow policies, Decision Transformer is naturally a multi-task model.
By conditioning on different target returns, we can output many different policies - in some cases, even extrapolating beyond the dataset:
6/8
Casting RL as a simple transformer trained with supervised learning would allow us to leverage the scalability & infra of successful models such as BERT, GPT-3, DALL-E for RL. We hope this work encourages more steps in this direction.
7/8