ChatGPT for Robotics? @Deepmind latest work: A general AI agent that can perform any task from human instructions!
Or at least those allowed in "the playhouse"
The cherry on top of this agent is its RL fine-tuning from human feedback, or RLHF. As in ChatGPT 1/n
The base layer of the agent is trained with imitation learning and conditioned on language instructions
Initially, the agent had mediocre abilities
However, when it was fine-tuned with Reinforcement Learning and allowed to act independently, its abilities 🆙 significantly
2/n
The authors structured the RL problem by training a Reward Model on human feedback, and then using this RW model to optimize the agent with online RL
The RW model, also called Inter-temporal Bradley-Terry (IBT), is trained to predict the preferences of sub-trajectories
3/n
A sub-trajectory is preferred over another of the same episode if it represents a improvement toward the goal. Not preferred if it's a regression.
Does it work? Check out this example 📊
It appears to be effective
4/n
Btw, they also augmented the loss of the IBT model with BC and contrastive SSL losses.
The BC+RL agent was trained using a "setter-replay" methodology. The environment was recreated based on some initial configs and the agent was left to interact freely & learn.
5/n
Guess what? BC+RL performed much better than everything else
They evaluated the agent on multiple ways: offline and online, both automatically and manually
In every context the BC+RW model is the best 6/n
Bonus point 1:
- BC + RL benefit from model scaling - Nice!
Bonus point 2:
- The agent can also be improved iteratively.
And it gets a lot better! 7/n
This is the story of an embodied multi-modal agent crafted over 4 papers and told in 4 posts
The embodied agent is able to perceive, manipulate the world, and react to human instructions in a 3D world
Work done by the Interactive Team at @DeepMind between 2019 and 2022
🧵
Imitating Interactive Intelligence arxiv.org/abs/2012.05672
The case for training the agent using Imitation Learning is outlined
The environment "The Playroom" is generated
The general multi-modal architecture is crafted
At the end, an auxiliary simil-GAIL loss is crucial 1/n
Interactive Agents with IL & SSL arxiv.org/abs/2112.03763
In the end it's all about scale and simplicity
The agent was hungry for data, so it was fed more
A simpler contrastive cross-modal loss replaced GAIL
A hierarchical 8-step action was introduced
New agent code name: MIA 2/n
RT-1 is a 2y effort to bring the power of open-ended task-agnostic training with a high-capacity architecture to the Robotic world.
The magic sauce? A big and diverse robotic dataset + an efficient Transformer-based architecture
🧵
RT-1 learn to take decisions in order to complete a task via imitation from a dataset of 130k episodes, about 700 general tasks, acquired over the course of 17mo.
The architecture of RT-1 is made of:
- A Vision-Language CNN-based architecture that encode the task instruction and image into 81 tokens
- A TokenLearner that attends over the 81 tokens and compress them to 8
- A Decoder-only Transformer that predicts the next action