Tony Z. Zhao Profile picture
Mar 27, 2023 β€’ 10 tweets β€’ 4 min read β€’ Read on X
How can robots acquire fine-grained manipulation skills?

Introducing ACT: Action Chunking with Transformers πŸ€–

Key idea: Imitation, but predict actions in chunks instead of one at a time.

Here are results with only ~15min of demonstrations, running on low-cost arms:
In case you missed ALOHA πŸ–, the hardware we use for all these experiments, here is the thread!
Fine manipulation is difficult: either from RL, Sim2Real, or Imitation.

- Hard exploration and sparse reward
- Large Sim2Real gap
- Compounding error for BC
- No large dataset

We introduce three important design choices behind ACT, an efficient imitation learning method: Image
(1) Predict action sequence

Standard BC predicts one action at a time, while a fine manipulation task can have >1000 steps easily.

Predicting action in chunks slows down compounding error, and can better model non-stationary human behavior. Image
(2) Generative model policy

The policy is trained as the decoder of a VAE, reconstructing action chunks from latent z, 4 RGB images, and proprioception.

Intuitively, z extracts the β€œstyle” of the action chunk.

This is crucial when learning from human demos. Image
(3) Transformer

We modernize the VAE by using a BERT-like encoder and a DETR-like decoder, training end-to-end from scratch.

This transformer architecture benefits more from chunking than ConvNets and non-parametric methods. ImageImage
With all above, ACT obtains 64%, 96%, 84%, 92% success for 4 tasks shown, with objects randomized along the 15 cm line.

It does not just memorize the training data, and is able to react to external disturbances:
It is also robust to a certain level of distractor objects:
Similar to ALOHA, we open source ACT together with 2 simulated environments for reproducibility. You can find it in the project website: tonyzhaozh.github.io/aloha/

We hope ALOHA+ACT would be a helpful resource towards advancing fine-grained manipulation!
Personally, this is a challenging project to work on, spanning from hardware to ML.
It would certainly not be possible without my amazing advisor @chelseabfinn and collaboration from @svlevine @Vikashplus!

β€’ β€’ β€’

Missing some Tweet in this thread? You can try to force a refresh
γ€€

Keep Current with Tony Z. Zhao

Tony Z. Zhao Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @tonyzzhao

Jan 3, 2024
Introducing πŒπ¨π›π’π₯𝐞 π€π‹πŽπ‡π€πŸ„ -- Hardware!
A low-cost, open-source, mobile manipulator.

One of the most high-effort projects in my past 5yrs! Not possible without co-lead @zipengfu and @chelseabfinn.

At the end, what's better than cooking yourself a meal with the πŸ€–πŸ§‘β€πŸ³
How does πŒπ¨π›π’π₯𝐞 π€π‹πŽπ‡π€ work? We seek to achieve a few more goals to augment the dexterity of the original π€π‹πŽπ‡π€:
1. Moves fast. Similar to human walking of 1.42m/s.
2. Stable. Manipulate heavy pots, a vacuum, etc.
3. Whole-body. All dofs teleoperated simultaneously.
4. Untethered. Onboard power and compute.
Read 7 tweets
Mar 27, 2023
Introducing ALOHA πŸ–: 𝐀 𝐋ow-cost 𝐎pen-source 𝐇𝐀rdware System for Bimanual Teleoperation

After 8 months iterating @stanford and 2 months working with beta users, we are finally ready to release it!

Here is what ALOHA is capable of:
@Stanford We built ALOHA to be maximally user-friendly for researchers: it is simple, dependable and performant.

The whole system costs <$20k, yet it is more capable than setups with 5-10x the price.
How does it work? ALOHA has two leader & two follower arms, and syncs the joint positions from leaders to followers at 50Hz. The user teleops by simply moving the leader robots.

This takes 10 lines to implement, yet intuitive and responsive anywhere within the joint limits.
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(