Research Scientist @nvidiaai GEAR Lab, world modeling lead. On leave from PhD at @uwcse
Feb 4 β’ 11 tweets β’ 4 min read
Introducing DreamZero π€πΒ from @nvidia
> A 14B βWorld Action Modelβ that achieves zero-shot generalization to unseen tasks & few-shot adaptation to new robots
> The key? Jointly predicting video & actions in the same diffusion forward pass
They know WHAT to do (semantics from VLM pretraining) but not HOW to execute with spatial precision (world physics).
"Move coke can to Taylor Swift" β
"Untie the shoelace" β (if never trained on it)
(2/10)
May 20, 2025 β’ 8 tweets β’ 4 min read
Introducing ππ«πππ¦πππ§!
We got humanoid robots to perform totally new π£ππππ in new environments through video world models.
We believe video world models will solve the data problem in robotics.
Bringing the paradigm of scaling human hours to GPU hours.
Quick π§΅
Currently, robot data scaling is done through human labor. Recent work showed some potential signs of robots doing useful things in unseen homes (i.e., open-world generalization), but this required taking the physical robots and collecting data in 100+ homes.
Oct 16, 2024 β’ 10 tweets β’ 4 min read
Excited to introduce ππππ: the first unsupervised pretraining method for Vision-Language-Action models.
Outperforms SOTA models trained with ground-truth actions
30x more efficient than conventional VLA pretraining
π:
π§΅ 1/9 arxiv.org/abs/2410.11758
Vision-Language-Action (VLA) models, LLMs aligned with vision encoders, show strong generalization when pretrained on robot datasets, enabling strong generalization capabilities. However, they remain limited by the scope of existing robot datasets.
2/9