Jim Fan Profile picture
@NVIDIA Sr. Research Manager & Lead of Embodied AI (GEAR Lab). Creating foundation models for Humanoid Robots & AI Agents. @Stanford Ph.D. @OpenAI's 1st intern.
19 subscribers
Nov 1 5 tweets 3 min read
I don’t know if we live in a Matrix, but I know for sure that robots will spend most of their lives in simulation. Let machines train machines. I’m excited to introduce DexMimicGen, a massive-scale synthetic data generator that enables a humanoid robot to learn complex skills from only a handful of human demonstrations. Yes, as few as 5!

DexMimicGen addresses the biggest pain point in robotics: where do we get data? Unlike with LLMs, where vast amounts of texts are readily available, you cannot simply download motor control signals from the internet. So researchers teleoperate the robots to collect motion data via XR headsets. They have to repeat the same skill over and over and over again, because neural nets are data hungry. This is a very slow and uncomfortable process.

At NVIDIA, we believe the majority of high-quality tokens for robot foundation models will come from simulation.

What DexMimicGen does is to trade GPU compute time for human time. It takes one motion trajectory from human, and multiplies into 1000s of new trajectories. A robot brain trained on this augmented dataset will generalize far better in the real world.

Think of DexMimicGen as a learning signal amplifier. It maps a small dataset to a large (de facto infinite) dataset, using physics simulation in the loop. In this way, we free humans from babysitting the bots all day.

The future of robot data is generative.
The future of the entire robot learning pipeline will also be generative. 🧵 Here’s one example: imagine asking a human to repeat this task 1000s of times to gather enough data variations — they’d be bored out of their mind. Just ask a simulator to do the hard work!!

2/🧵
Jul 30 6 tweets 4 min read
Exciting updates on Project GR00T! We discover a systematic way to scale up robot data, tackling the most painful pain point in robotics. The idea is simple: human collects demonstration on a real robot, and we multiply that data 1000x or more in simulation. Let’s break it down:

1. We use Apple Vision Pro (yes!!) to give the human operator first person control of the humanoid. Vision Pro parses human hand pose and retargets the motion to the robot hand, all in real time. From the human’s point of view, they are immersed in another body like the Avatar. Teleoperation is slow and time-consuming, but we can afford to collect a small amount of data.

2. We use RoboCasa, a generative simulation framework, to multiply the demonstration data by varying the visual appearance and layout of the environment. In Jensen’s keynote video below, the humanoid is now placing the cup in hundreds of kitchens with a huge diversity of textures, furniture, and object placement. We only have 1 physical kitchen at the GEAR Lab in NVIDIA HQ, but we can conjure up infinite ones in simulation.

3. Finally, we apply MimicGen, a technique to multiply the above data even more by varying the *motion* of the robot. MimicGen generates vast number of new action trajectories based on the original human data, and filters out failed ones (e.g. those that drop the cup) to form a much larger dataset.

To sum up, given 1 human trajectory with Vision Pro
-> RoboCasa produces N (varying visuals)
-> MimicGen further augments to NxM (varying motions).

This is the way to trade compute for expensive human data by GPU-accelerated simulation. A while ago, I mentioned that teleoperation is fundamentally not scalable, because we are always limited by 24 hrs/robot/day in the world of atoms. Our new GR00T synthetic data pipeline breaks this barrier in the world of bits.

Scaling has been so much fun for LLMs, and it's finally our turn to have fun in robotics! We are building tools to enable everyone in the ecosystem to scale up with us. Links in thread: RoboCasa: our generative simulation framework. It's fully open-source! Here you go:

robocasa.ai
Feb 23 10 tweets 4 min read
Career update: I am co-founding a new research group called "GEAR" at NVIDIA, with my long-time friend and collaborator Prof. @yukez. GEAR stands for Generalist Embodied Agent Research.

We believe in a future where every machine that moves will be autonomous, and robots and simulated agents will be as ubiquitous as iPhones. We are building the Foundation Agent — a generally capable AI that learns to act skillfully in many worlds, virtual and real.

2024 is the Year of Robotics, the Year of Gaming AI, and the Year of Simulation. We are setting out on a moon-landing mission, and getting there will spin off mountains of learnings and breakthroughs.

Join us on the journey: research.nvidia.com/labs/gear/Image Here's a highlight thread on the exciting research that we spearheaded!

Eureka: GPT-4 writes reward functions to teach a 5-finger robot hand how to do pen spinning tricks better than I can. Trained with GPU-accelerated physics simulation at 1000x faster than real-time!

Jan 4 4 tweets 3 min read
What did I tell you a few days ago? 2024 is the year of robotics. Mobile-ALOHA is an open-source robot hardware that can do dexterous, bimanual tasks like cooking a meal (with human teleoperation). Very soon, hardware will no longer bottleneck us on the quest for human-level, generally capable robots. The brain will be.

This work is done by 3 researchers with academic budget. What an incredible job! Stanford rocks! Congrats to @zipengfu @tonyzzhao @chelseabfinn

Academia is no longer the place for the biggest frontier LLMs, simply because of resource constraints. But robotics levels the playing field a bit between academia and industry, at least in the near term. More affordable hardware is the inevitable trend. Advice for aspiring PhD students: embrace robotics - less crowded, more impactful.

Website:
Hardware assembly tutorial (oh yes we need more of these!):
Codebase: mobile-aloha.github.io
docs.google.com/document/d/1_3…
github.com/MarkFzp/mobile… Linking the great explanatory threads from the authors, @zipengfu 1/3

Dec 13, 2023 4 tweets 2 min read
I confirmed with friends at the team that they did not speed up the video. Having such smooth motions at real-time, especially in hand dexterity, will unlock LOTS of new capabilities down the road. Regardless of how well you train the model in the world of bits, a slow and unreliable hardware will always be the fundamental bottleneck in the world of atoms.

The tactile sensing on fingers is the obvious right path forward. Now we can train truly multimodal robot transformers that take in text, video, audio, touch, proprioception (position, orientation, motion sensing) and some day, even smell and touch. The output is humanoid motor controls.

Can Optimus spin pens? Someone please try out our Eureka method and let me know? @Tesla_Optimus 👏 Btw, this is Eureka from my team at NVIDIA Research!

Nov 30, 2023 4 tweets 3 min read
This is the coolest Diffusion work I've seen in a while! It generates Visual Anagrams, a type of optical illusion where an image looks like one thing, but changes appearance when transformed.

It works with any orthogonal transformation matrices, which luckily include rotation, permutation (jigsaw puzzles), and color negation.

Intuitively, the method first inverts the noise from multiple image transforms (with different text prompts), and then average them. After taking a diffusion step in the averaged noise, the resulting image becomes an anagram that aligns with the texts in different views. It does very little computation, using pre-trained Stable Diffusion.

Simple, elegant, and inexpensive technique for non-professionals to create some interesting art!

Paper:
Website:
It's open-source!
Authors: Daniel Geng, Inbum Park, Andrew Owens.arxiv.org/abs/2311.17919
dangeng.github.io/visual_anagram…
github.com/dangeng/visual… More examples, jigsaw:
Nov 20, 2023 8 tweets 3 min read
My team at NVIDIA is hiring. We 🩷 you all from OpenAI. Engineers, researchers, product team, alike. Email me at linxif@nvidia.com. DM is open too. NVIDIA has warm GPUs for you on a cold winter night like this, fresh out of the oven.🩷

I do research on AI agents. Gaming+AI, robotics, multimodal LLMs, open-ended simulations, etc. If you want an excuse to play games like Minecraft at work - I'm your guy.

I'm shocked by the ongoing development. I can only begin to grasp the depth of what you must be going through. Please, don't hesitate to ping me if there's anything I can do to help, or just say hi and share anything you'd like to talk about. I'm a good listener.
Image Sharing appetizers with my distinguished guests: here are my team's research highlights!

Voyager: the first LLM-powered agent that plays Minecraft proficiently. Voyager bootstraps its own capabilities as it explores the open-ended world continuously.

Oct 20, 2023 7 tweets 4 min read
Can GPT-4 teach a robot hand to do pen spinning tricks better than you do?

I'm excited to announce Eureka, an open-ended agent that designs reward functions for robot dexterity at super-human level. It’s like Voyager in the space of a physics simulator API!

Eureka bridges the gap between high-level reasoning (coding) and low-level motor control. It is a “hybrid-gradient architecture”: a black box, inference-only LLM instructs a white box, learnable neural network. The outer loop runs GPT-4 to refine the reward function (gradient-free), while the inner loop runs reinforcement learning to train a robot controller (gradient-based).

We are able to scale up Eureka thanks to IsaacGym, a GPU-accelerated physics simulator that speeds up reality by 1000x. On a benchmark suite of 29 tasks across 10 robots, Eureka rewards outperform expert human-written ones on 83% of the tasks by 52% improvement margin on average. We are surprised that Eureka is able to learn pen spinning tricks, which are very difficult even for CGI artists to animate frame by frame!

Eureka also enables a new form of in-context RLHF, which is able to incorporate a human operator’s feedback in natural language to steer and align the reward functions. It can serve as a powerful co-pilot for robot engineers to design sophisticated motor behaviors.

As usual, we open-source everything! Welcome you all to check out our video gallery and try the codebase today:
Paper:
Code:

Deep dive with me: 🧵
In robot learning, LLMs are good at generating high-level plans and mid-level actions like pick and place (VIMA, RT-1, etc.), but fall short of complex, high-frequency motor controls.

The Eureka! moment for us (pun intended) is that reward functions through coding is the key portal where LLMs can venture into dexterous skills.

2/
Aug 23, 2023 14 tweets 5 min read
MidJourney has been sitting on the iron throne of text-to-image, and very few startups can mount a serious challenge.

But a new player is in town! is built by the former Google Brain Imagen team. One of the founders, Jonathan Ho, was lead author of the OG diffusion paper. The raw talent could make them a potential tour de force.

Ideogram's main strength now is text rendering. I played around with the model thanks to @a16z's invite. Lots of demos in the 🧵!

Launch:
Google Imagen:
Imagen-video: . Given the team's track record, I believe text-to-video is likely on their critical path.Ideogram.ai "An adorable minion holding a sign that says "It's over, MidJourney", spelled exactly, 3d render, typography"

Note that the view represents that of a minion, and NOT my own 🤔

All examples are cherry picked. Ideogram isn't always able to spell things correctly, but has a decent success rate.
Jul 24, 2023 4 tweets 2 min read
Robotics will be the last moat we conquer in AI. What would a RobotGPT's API look like?

Introducing VIMA, an LLM with a robot arm attached 🦾. It takes in multimodal prompt: text, images, videos, or mixture of them.

You can say "rearrange the table to look like <image>" or… https://t.co/8wQOoYlQR0twitter.com/i/web/status/1…
This figure shows VIMA's multimodal prompting framework.

Left: we treat each image as a sub-sequence of "object tokens", and interleave them with the regular text tokens.

Right: VIMA decodes one robot arm action at a time, and reacts to the changes in the world.

2/ Image
Jul 6, 2023 5 tweets 2 min read
I believe next-gen LLMs will heavily borrow insights from a decade of game AI research.

▸ Noam Brown, creator of Libratus poker AI, is joining OpenAI.
▸ Demis Hassabis says that DeepMind Gemini will tap techniques from AlphaGo.

These moves make a lot of sense. Methods like… https://t.co/9FxtDAqmJotwitter.com/i/web/status/1…
Noam Brown's announcement @polynoamial:
Jun 17, 2023 5 tweets 2 min read
When we run out of good training data in reality, simulation is the next gold mine.

Enters Infinigen: an open-source, procedurally generated, photorealistic dataset for 3D vision. The quality is stunning! No two worlds are the same.

▸ Every little detail is randomized and… twitter.com/i/web/status/1… High-quality automatic annotations for training visual foundation models: Image
Jun 12, 2023 4 tweets 5 min read
Today 6 years ago, "Attention is All You Need" went on Arxiv! Happy birthday Transformer! 🎂

Fun facts:
- Transformer did not invent attention, but pushed it to the extreme. The first attention paper was published 3 years prior (2014) and had an unassuming title: "Neural Machine… twitter.com/i/web/status/1… - Animation from the official Google Blog: ai.googleblog.com/2017/08/transf…
- NeurIPS 2017 link to Transformer: nips.cc/Conferences/20…
- Arxiv: arxiv.org/abs/1706.03762 Image
May 31, 2023 15 tweets 6 min read
Chef's pick time: boosting signal-to-noise ratio of AI Twitter, using my biological LLM.

Best of AI Curation Vol. 3, covers:

No-gradient architecture, LLM tool making and mastery (3 papers!), RLHF without RL, an uncensored LLM, an open course, and more.

Deep dive with me: 🧵 Image credit: https://www.g... No-gradient architecture is the future for decision-making agents. LLM acts as a "prefrontal cortex" that orchestrates lower-level control APIs via code generation. Voyager takes the first step in Minecraft. @karpathy says it best, as always:

May 26, 2023 12 tweets 7 min read
What if we set GPT-4 free in Minecraft? ⛏️

I’m excited to announce Voyager, the first lifelong learning agent that plays Minecraft purely in-context. Voyager continuously improves itself by writing, refining, committing, and retrieving *code* from a skill library.

GPT-4 unlocks… twitter.com/i/web/status/1… Generally capable, autonomous agents are the next frontier of AI. They continuously explore, plan, and develop new skills in open-ended worlds, driven by survival & curiosity.

Minecraft is by far the best testbed with endless possibilities for agents:

May 25, 2023 11 tweets 4 min read
$NVDA will not stop at selling picks & shovels for the LLM gold rush. Foundation Models as a Service is coming.

I have the great fortune to play a part in NVIDIA Research, which produces too many top-notch AI works to count. Some examples: 🧵 Image NVIDIA AI Foundation, a new initiative that Jensen announced in March:

- LLM customized to enterprise proprietary data.
- Multimodal generative models
- Biology LLM!

May 23, 2023 4 tweets 2 min read
3 mo ago, I said windows will be the first AI-first OS. Surely, Microsoft delivers with a sharp vision and steady hand. To me, Windows Copilot is a way bigger deal than Bing Chat. It's becoming a full-fledged agent that takes *actions* on the OS & native software level, given… twitter.com/i/web/status/1… Sorry, re-posted this because video failed to work in the previous post.

Here's Microsoft's official launch blog: blogs.windows.com/windowsdevelop…
May 22, 2023 11 tweets 6 min read
Curating high-quality posts from AI Twitter with my own take, Vol. 2

No breaking news, no productivity hack, no insane moments. Just some solid stuff that makes AI a bit better than last week.

Time for Chef's pick. Here we go: Image MEGABYTE from Meta AI, a multi-resolution Transformer that operates directly on raw bytes. This signals the beginning of the end of tokenization.

Why is tokenization undesirable? @karpathy explains it best:

May 21, 2023 4 tweets 2 min read
If you insert electric probes into an insect *before* adulthood, its tissues can organically grow around the probe and unlock a high-bandwidth insect-machine interface.

Then you can read data from the insect's brain and *control* its flight by stimulation. This is from 2009, but… twitter.com/i/web/status/1… ImageImage Paper: Insect–Machine Interface Based Neurocybernetics.

Link: ibionics.ece.ncsu.edu/assets/Publica… Image
May 12, 2023 12 tweets 7 min read
AI Twitter is flooded with low-quality stuff recently. No, GPT is not “dethroned”. And thin wrapper apps are not “insane”. At all.

I feel obligated to surface some quality posts I bookmarked. Every one of them should've been promoted 10x, but ¯\_(ツ)_/¯

In no particular order: Image If you only have 1 seat to follow in AI Twitter, don't give that seat to me. Give it to @karpathy.

Andrej has the best take, by far, on the landscape of the open-source LLM ecosystem.

1/
May 10, 2023 4 tweets 2 min read
Finally happening: HuggingFace Transformers Agent. It enables a coding LLM to compose other HF models on the fly to solve multimodal tasks.

It's a step towards the Everything App, which grows in capability as the ecosystem grows.

I've been waiting for this since HuggingGPT: 🧵 Image HuggingGPT is the first demonstration of such an idea at scale. It uses GPT as a controller to dynamically pick tools (models) to solve a multi-stage task.

2/