Jim Fan Profile picture
May 26, 2023 12 tweets 7 min read Read on X
What if we set GPT-4 free in Minecraft? ⛏️

I’m excited to announce Voyager, the first lifelong learning agent that plays Minecraft purely in-context. Voyager continuously improves itself by writing, refining, committing, and retrieving *code* from a skill library.

GPT-4 unlocks… twitter.com/i/web/status/1…
Generally capable, autonomous agents are the next frontier of AI. They continuously explore, plan, and develop new skills in open-ended worlds, driven by survival & curiosity.

Minecraft is by far the best testbed with endless possibilities for agents:

Voyager has 3 key components:

1) An iterative prompting mechanism that incorporates game feedback, execution errors, and self-verification to refine programs;
2) A skill library of code to store & retrieve complex behaviors;
3) An automatic curriculum to maximize exploration.
First, Voyager attempts to write a program to achieve a particular goal, using a popular Javascript Minecraft API (Mineflayer). The program is likely incorrect at the first try. The game environment feedback and javascript execution error (if any) help GPT-4 refine the program. Image
Second, Voyager incrementally builds a skill library by storing the successful programs in a vector DB. Each program can be retrieved by the embedding of its docstring. Complex skills are synthesized by composing simpler skills, which compounds Voyager’s capabilities over time. Image
Third, an automatic curriculum proposes suitable exploration tasks based on the agent’s current skill level & world state, e.g. learn to harvest sand & cactus before iron if it finds itself in a desert rather than a forest.

Think of it as an in-context form of *novelty search*. Image
Putting these all together, here’s the full data flow design that drives lifelong learning in a vast 3D voxel world without any human intervention. Image
Let’s look at some experiments!

We evaluate Voyager systematically against other LLM-based agent techniques, such as ReAct, Reflexion, and the popular AutoGPT in Minecraft.

Voyager discovers 63 unique items within 160 prompting iterations, 3.3x more than the next best approach. Image
The novelty-seeking automatic curriculum naturally compels Voyager to travel extensively. Without being explicitly instructed to do so, Voyager traverses 2.3x longer distances and visits more terrains than the baselines, which are “lazier” and often stuck in local areas. Image
How good is the “trained model”, i.e. skill library after lifelong learning?

We clear the agent’s inventory/armors, spawn a new world, and test with unseen tasks. Voyager solves them significantly faster. Our skill library even boosts AutoGPT, since code is easily transferrable. Image
Voyager is currently text-only, but can be augmented by visual perception in the future. We do a preliminary study where humans act like an image captioning model and provide feedback to Voyager.

It is able to construct complex 3D structures, such as a Nether Portal and a house.
Let agents emerge in Minecraft! All open-source: voyager.minedojo.org

This work is co-authored by my team at NVIDIA: @guanzhi_wang (our awesome intern), @yuqi_xie5, @YunfanJiang, @AjayMandlekar, @ChaoweiX, @yukez, @DrJimFan (myself, co-advisor), @AnimaAnandkumar (co-advisor). Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jim Fan

Jim Fan Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @DrJimFan

Feb 23
Career update: I am co-founding a new research group called "GEAR" at NVIDIA, with my long-time friend and collaborator Prof. @yukez. GEAR stands for Generalist Embodied Agent Research.

We believe in a future where every machine that moves will be autonomous, and robots and simulated agents will be as ubiquitous as iPhones. We are building the Foundation Agent — a generally capable AI that learns to act skillfully in many worlds, virtual and real.

2024 is the Year of Robotics, the Year of Gaming AI, and the Year of Simulation. We are setting out on a moon-landing mission, and getting there will spin off mountains of learnings and breakthroughs.

Join us on the journey: research.nvidia.com/labs/gear/Image
Here's a highlight thread on the exciting research that we spearheaded!

Eureka: GPT-4 writes reward functions to teach a 5-finger robot hand how to do pen spinning tricks better than I can. Trained with GPU-accelerated physics simulation at 1000x faster than real-time!

Voyager: the first LLM-powered agent that plays Minecraft proficiently. Voyager bootstraps its own capabilities as it explores the open-ended world continuously.

Read 10 tweets
Jan 4
What did I tell you a few days ago? 2024 is the year of robotics. Mobile-ALOHA is an open-source robot hardware that can do dexterous, bimanual tasks like cooking a meal (with human teleoperation). Very soon, hardware will no longer bottleneck us on the quest for human-level, generally capable robots. The brain will be.

This work is done by 3 researchers with academic budget. What an incredible job! Stanford rocks! Congrats to @zipengfu @tonyzzhao @chelseabfinn

Academia is no longer the place for the biggest frontier LLMs, simply because of resource constraints. But robotics levels the playing field a bit between academia and industry, at least in the near term. More affordable hardware is the inevitable trend. Advice for aspiring PhD students: embrace robotics - less crowded, more impactful.

Website:
Hardware assembly tutorial (oh yes we need more of these!):
Codebase: mobile-aloha.github.io
docs.google.com/document/d/1_3…
github.com/MarkFzp/mobile…
Linking the great explanatory threads from the authors, @zipengfu 1/3

Read 4 tweets
Dec 13, 2023
I confirmed with friends at the team that they did not speed up the video. Having such smooth motions at real-time, especially in hand dexterity, will unlock LOTS of new capabilities down the road. Regardless of how well you train the model in the world of bits, a slow and unreliable hardware will always be the fundamental bottleneck in the world of atoms.

The tactile sensing on fingers is the obvious right path forward. Now we can train truly multimodal robot transformers that take in text, video, audio, touch, proprioception (position, orientation, motion sensing) and some day, even smell and touch. The output is humanoid motor controls.

Can Optimus spin pens? Someone please try out our Eureka method and let me know? @Tesla_Optimus 👏
Btw, this is Eureka from my team at NVIDIA Research!

Typo in the original post: I meant "... and some day, even smell and taste". Can't wait for robot chefs in 3 yrs!
Read 4 tweets
Nov 30, 2023
This is the coolest Diffusion work I've seen in a while! It generates Visual Anagrams, a type of optical illusion where an image looks like one thing, but changes appearance when transformed.

It works with any orthogonal transformation matrices, which luckily include rotation, permutation (jigsaw puzzles), and color negation.

Intuitively, the method first inverts the noise from multiple image transforms (with different text prompts), and then average them. After taking a diffusion step in the averaged noise, the resulting image becomes an anagram that aligns with the texts in different views. It does very little computation, using pre-trained Stable Diffusion.

Simple, elegant, and inexpensive technique for non-professionals to create some interesting art!

Paper:
Website:
It's open-source!
Authors: Daniel Geng, Inbum Park, Andrew Owens.arxiv.org/abs/2311.17919
dangeng.github.io/visual_anagram…
github.com/dangeng/visual…
More examples, jigsaw:
Read 4 tweets
Nov 20, 2023
My team at NVIDIA is hiring. We 🩷 you all from OpenAI. Engineers, researchers, product team, alike. Email me at linxif@nvidia.com. DM is open too. NVIDIA has warm GPUs for you on a cold winter night like this, fresh out of the oven.🩷

I do research on AI agents. Gaming+AI, robotics, multimodal LLMs, open-ended simulations, etc. If you want an excuse to play games like Minecraft at work - I'm your guy.

I'm shocked by the ongoing development. I can only begin to grasp the depth of what you must be going through. Please, don't hesitate to ping me if there's anything I can do to help, or just say hi and share anything you'd like to talk about. I'm a good listener.
Image
Sharing appetizers with my distinguished guests: here are my team's research highlights!

Voyager: the first LLM-powered agent that plays Minecraft proficiently. Voyager bootstraps its own capabilities as it explores the open-ended world continuously.

Eureka: GPT-4 writes reward functions to teach 5-finger robot hand how to do pen spinning tricks better than I can. The robots are trained with GPU-accelerated physics simulation at 1000x faster than real-time!

Read 8 tweets
Oct 20, 2023
Can GPT-4 teach a robot hand to do pen spinning tricks better than you do?

I'm excited to announce Eureka, an open-ended agent that designs reward functions for robot dexterity at super-human level. It’s like Voyager in the space of a physics simulator API!

Eureka bridges the gap between high-level reasoning (coding) and low-level motor control. It is a “hybrid-gradient architecture”: a black box, inference-only LLM instructs a white box, learnable neural network. The outer loop runs GPT-4 to refine the reward function (gradient-free), while the inner loop runs reinforcement learning to train a robot controller (gradient-based).

We are able to scale up Eureka thanks to IsaacGym, a GPU-accelerated physics simulator that speeds up reality by 1000x. On a benchmark suite of 29 tasks across 10 robots, Eureka rewards outperform expert human-written ones on 83% of the tasks by 52% improvement margin on average. We are surprised that Eureka is able to learn pen spinning tricks, which are very difficult even for CGI artists to animate frame by frame!

Eureka also enables a new form of in-context RLHF, which is able to incorporate a human operator’s feedback in natural language to steer and align the reward functions. It can serve as a powerful co-pilot for robot engineers to design sophisticated motor behaviors.

As usual, we open-source everything! Welcome you all to check out our video gallery and try the codebase today:
Paper:
Code:

Deep dive with me: 🧵
In robot learning, LLMs are good at generating high-level plans and mid-level actions like pick and place (VIMA, RT-1, etc.), but fall short of complex, high-frequency motor controls.

The Eureka! moment for us (pun intended) is that reward functions through coding is the key portal where LLMs can venture into dexterous skills.

2/
Eureka achieves human-level reward design by evolving reward functions in-context. There are 3 key components:

1. Simulator environment code as context jumpstarts the initial "seed" reward function.
2. Massively parallel RL on GPUs enables rapid evaluation of lots of reward candidates.
3. Reward reflection produces targeted reward mutations in-context.

3/
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(