Fine manipulation is difficult: either from RL, Sim2Real, or Imitation.
- Hard exploration and sparse reward
- Large Sim2Real gap
- Compounding error for BC
- No large dataset
We introduce three important design choices behind ACT, an efficient imitation learning method:
(1) Predict action sequence
Standard BC predicts one action at a time, while a fine manipulation task can have >1000 steps easily.
Predicting action in chunks slows down compounding error, and can better model non-stationary human behavior.
(2) Generative model policy
The policy is trained as the decoder of a VAE, reconstructing action chunks from latent z, 4 RGB images, and proprioception.
Intuitively, z extracts the βstyleβ of the action chunk.
This is crucial when learning from human demos.
(3) Transformer
We modernize the VAE by using a BERT-like encoder and a DETR-like decoder, training end-to-end from scratch.
This transformer architecture benefits more from chunking than ConvNets and non-parametric methods.
With all above, ACT obtains 64%, 96%, 84%, 92% success for 4 tasks shown, with objects randomized along the 15 cm line.
It does not just memorize the training data, and is able to react to external disturbances:
It is also robust to a certain level of distractor objects:
Similar to ALOHA, we open source ACT together with 2 simulated environments for reproducibility. You can find it in the project website: tonyzhaozh.github.io/aloha/
We hope ALOHA+ACT would be a helpful resource towards advancing fine-grained manipulation!
Personally, this is a challenging project to work on, spanning from hardware to ML.
It would certainly not be possible without my amazing advisor @chelseabfinn and collaboration from @svlevine@Vikashplus!
β’ β’ β’
Missing some Tweet in this thread? You can try to
force a refresh
Introducing ALOHA π: π πow-cost πpen-source ππrdware System for Bimanual Teleoperation
After 8 months iterating @stanford and 2 months working with beta users, we are finally ready to release it!
Here is what ALOHA is capable of:
@Stanford We built ALOHA to be maximally user-friendly for researchers: it is simple, dependable and performant.
The whole system costs <$20k, yet it is more capable than setups with 5-10x the price.
How does it work? ALOHA has two leader & two follower arms, and syncs the joint positions from leaders to followers at 50Hz. The user teleops by simply moving the leader robots.
This takes 10 lines to implement, yet intuitive and responsive anywhere within the joint limits.