Jim (Linxi) Fan Profile picture
Oct 7 8 tweets 6 min read
We trained a transformer called VIMA that ingests *multimodal* prompt and outputs controls for a robot arm. A single agent is able to solve visual goal, one-shot imitation from video, novel concept grounding, visual constraint, etc. Strong scaling with model capacity and data!🧵
We envision that a generalist robot agent should have an intuitive and expressive interface for task specification, but text alone is not enough. We introduce a novel multimodal prompting framework that converts a wide spectrum of robotic tasks into one sequence modeling problem.
Our VIMA model (reads “v-eye-ma”) consists of a pre-trained T5 to encode multimodal prompts, and a transformer decoder to predict robot arm commands autoregressively. The decoder has alternating self- and cross-attention layers conditioned on the prompt.
We introduce a new benchmark, VIMA-Bench, that features 17 meta-tasks with multimodal prompt templates, which can procedurally generate 1000s of tasks. We design a protocol of 4 generalization levels to systematically evaluate the zero-shot capabilities of the robot agents.
VIMA achieves strong scalability in model capacity. Across 4 generalization levels, VIMA consistently outperforms all prior methods (Gato, Flamingo, DT) ranging from 2M to 200M parameters. On the hardest novel task generalization test, it obtains up to 2.9x better performance.
VIMA is highly sample efficient as well. With 10x less training data, it is able to attain similar performance as prior methods on average. On the hardest generalization setting, VIMA outperforms all baselines with 100x less data.
We open-source *everything*: code, pretrained models, dataset, and simulation benchmark!
🌐 Project site: vimalabs.github.io
📄 Arxiv: arxiv.org/abs/2210.03094
📄 PDF: vimalabs.github.io/assets/vima_pa…
💻 Codebases: github.com/vimalabs
Please follow our team members for project updates!
@YunfanJiang, @agrimgupta92, @zcczhang, @guanzhi_wang, @yongqiangdou, Yanjun Chen
Advisors: @drfeifei, @AnimaAnandkumar, @yukez, @DrJimFan. [END/🧵]

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jim (Linxi) Fan

Jim (Linxi) Fan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(