Allen Z. Ren Profile picture
PhD student in robotics @Princeton with @Majumdar_Ani. Student researcher at @GoogleDeepMind. Past intern at @ToyotaResearch.

Jul 6, 2023, 12 tweets

LLMs can generate plans and write robot code πŸ“ but they can also make mistakes. How do we get LLMs to 𝘬𝘯𝘰𝘸 𝘸𝘩𝘦𝘯 𝘡𝘩𝘦𝘺 π˜₯𝘰𝘯'𝘡 𝘬𝘯𝘰𝘸 🀷 and ask for help?

Read more on how we can do this (with statistical guarantees) for LLMs on robots πŸ‘‡
https://t.co/M9lUqlZ5cBrobot-help.github.io

Exploring LLM uncertainty in the context of generating robot plans is especially crucial because of safety considerations 🚧

Instructions from people can be ambiguous, and LLMs are prone to hallucinating. Poor outputs can lead to unsafe actions and consequences.

For example, if a robot πŸ€– is tasked to "put the bowl in the microwave" but sees two bowls – a metal and plastic one – the uncertainty of the LLM should trigger the robot to ask for help πŸ›Ÿ

Greedily choosing e.g. the metal bowl can damage the microwave or even cause a fire πŸ”₯

Off-the-shelf LLM predictions do come with confidence scores, but can be miscalibrated πŸ“

Our framework "KnowNo" builds off of conformal prediction (CP) theory to model LLM uncertainty: generate a set of predictions, then quantify how likely it contains a correct option.

CP provides statistical guarantees: with user-specified probability, the prediction sets contain the correct plans at test time!

KnowNo triggers human helpπŸ›Ÿwhen the prediction set has more than one option. Baselines that use the scores without calibration πŸ“or directly ask LLM if it is uncertain can trigger unnecessary help.

KnowNo can also quantify LLM planner uncertainty in multi-step planning settings, such as sorting food items πŸ₯• based on human preferences with feedback.

In mobile manipulation settings, common home-robot task instructions can often under-specify the object (β€œthe chips”) or target location (β€œthe drawer”)

In bimanual settings, the arms' reachability is limited and there is ambiguity in the choice of arm for the specific task

We ran all experiments with PaLM-2L model, which provides reasonably calibrated confidences. We find that GPT3.5 suffers from recency bias in MCQA. Nonetheless, KnowNo still achieves the target success level by triggering more human help.

This work comes from collaboration between @EPrinceton and @DeepMind, including @anushridixit111, Alexandra Bodrova, @Sumeet_Robotics, @stephenltu, Noah Brown, @sippeyxp, @leilatakayama, @xf1280, Jake Varley, @Zhenjia_Xu, @DorsaSadigh, @andyzeng_, @Majumdar_Ani

Future work could incorporate uncertainty of vision-language models in the pipeline. Quantifying uncertainty builds trust 🀝between us and robots. Let’s make them safe and reliable!

Website:
Paper: https://t.co/Z0xkZr4dsW
Colab codes available soonrobot-help.github.io
arxiv.org/abs/2307.01928

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling