Decision Transformer is just a GPT model conditioned on desired returns. Returns, states and actions are fed into the model like tokens in a sentence (trajectory).
At evaluation time, specify the desired episode return and sequentially sample actions to get your policy.
For simplicity, we consider offline RL setting (although we aren't limited to this).
In offline RL, we train on a fixed dataset of collected experience, mimicking language modeling setup and enabling data-driven behavior learning. But this isn't just imitation learning...
Like Q-learning algorithms, Decision Transformer can "stitch" together subsequences from distinct training examples - just with a sequence modeling objective!
When trained only on random walks over a graph, Decision Transformer learns to generate an optimal shortest path:
On commonly studied offline RL benchmarks, we find this simple idea of sequence modeling with a scalable transformer model performs on par (or better) than SoTA model-free offline RL algorithms!
Unlike traditional RL methods that learn narrow policies, Decision Transformer is naturally a multi-task model.
By conditioning on different target returns, we can output many different policies - in some cases, even extrapolating beyond the dataset:
Casting RL as a simple transformer trained with supervised learning would allow us to leverage the scalability & infra of successful models such as BERT, GPT-3, DALL-E for RL. We hope this work encourages more steps in this direction.