RT-1 is a 2y effort to bring the power of open-ended task-agnostic training with a high-capacity architecture to the Robotic world.
The magic sauce? A big and diverse robotic dataset + an efficient Transformer-based architecture
🧵
RT-1 learn to take decisions in order to complete a task via imitation from a dataset of 130k episodes, about 700 general tasks, acquired over the course of 17mo.
The architecture of RT-1 is made of:
- A Vision-Language CNN-based architecture that encode the task instruction and image into 81 tokens
- A TokenLearner that attends over the 81 tokens and compress them to 8
- A Decoder-only Transformer that predicts the next action
It was evaluated over 3000 real-world trials... A lot of work!
What they found is that RT-1, differently from its predecessors (BC-Z & Gato), has greater generalization skills. Perform much better at unseen tasks and with more visual clutter.
RT-1 can ingest and learn new skills not only from both real and simulated sources, but also from tasks performed on different robots.