This is the story of an embodied multi-modal agent crafted over 4 papers and told in 4 posts
The embodied agent is able to perceive, manipulate the world, and react to human instructions in a 3D world
Work done by the Interactive Team at @deepmind between 2019 and 2022
🧵
Imitating Interactive Intelligence arxiv.org/abs/2012.05672
The case for training the agent using Imitation Learning is outlined
The environment "The Playroom" is generated
The general multi-modal architecture is crafted
At the end, an auxiliary simil-GAIL loss is crucial 1/n
Interactive Agents with IL & SSL arxiv.org/abs/2112.03763
In the end it's all about scale and simplicity
The agent was hungry for data, so it was fed more
A simpler contrastive cross-modal loss replaced GAIL
A hierarchical 8-step action was introduced
New agent code name: MIA 2/n
Evaluating Interactive Agents arxiv.org/abs/2205.13274
Evaluation becomes the bottleneck
Agents evaluated with a new approach called Standardized Test Suite. Still manual, but offline. Faster, more interpretable & controllable
A new breed of agent is created: similar to MIA but with RLHF tuning and a learned RW model
As always.. The agent ingested more data
Also a new interactive evaluation is introduced 4/n
Question: With the new RLHF approach, did it converge to a more standard training methodology?
RT-1 is a 2y effort to bring the power of open-ended task-agnostic training with a high-capacity architecture to the Robotic world.
The magic sauce? A big and diverse robotic dataset + an efficient Transformer-based architecture
🧵
RT-1 learn to take decisions in order to complete a task via imitation from a dataset of 130k episodes, about 700 general tasks, acquired over the course of 17mo.
The architecture of RT-1 is made of:
- A Vision-Language CNN-based architecture that encode the task instruction and image into 81 tokens
- A TokenLearner that attends over the 81 tokens and compress them to 8
- A Decoder-only Transformer that predicts the next action