Thread by @allenzren on Thread Reader App

LLMs can generate plans and write robot code 📝 but they can also make mistakes. How do we get LLMs to 𝘬𝘯𝘰𝘸 𝘸𝘩𝘦𝘯 𝘵𝘩𝘦𝘺 𝘥𝘰𝘯'𝘵 𝘬𝘯𝘰𝘸 🤷 and ask for help?

Read more on how we can do this (with statistical guarantees) for LLMs on robots 👇
https://t.co/M9lUqlZ5cBrobot-help.github.io

Exploring LLM uncertainty in the context of generating robot plans is especially crucial because of safety considerations 🚧

Instructions from people can be ambiguous, and LLMs are prone to hallucinating. Poor outputs can lead to unsafe actions and consequences.

For example, if a robot 🤖 is tasked to "put the bowl in the microwave" but sees two bowls – a metal and plastic one – the uncertainty of the LLM should trigger the robot to ask for help 🛟

Greedily choosing e.g. the metal bowl can damage the microwave or even cause a fire 🔥

Off-the-shelf LLM predictions do come with confidence scores, but can be miscalibrated 📐

Our framework "KnowNo" builds off of conformal prediction (CP) theory to model LLM uncertainty: generate a set of predictions, then quantify how likely it contains a correct option.

CP provides statistical guarantees: with user-specified probability, the prediction sets contain the correct plans at test time!

KnowNo triggers human help🛟when the prediction set has more than one option. Baselines that use the scores without calibration 📐or directly ask LLM if it is uncertain can trigger unnecessary help.

KnowNo can also quantify LLM planner uncertainty in multi-step planning settings, such as sorting food items 🥕 based on human preferences with feedback.

In mobile manipulation settings, common home-robot task instructions can often under-specify the object (“the chips”) or target location (“the drawer”)

In bimanual settings, the arms' reachability is limited and there is ambiguity in the choice of arm for the specific task

We ran all experiments with PaLM-2L model, which provides reasonably calibrated confidences. We find that GPT3.5 suffers from recency bias in MCQA. Nonetheless, KnowNo still achieves the target success level by triggering more human help.

This work comes from collaboration between @EPrinceton and @DeepMind, including @anushridixit111, Alexandra Bodrova, @Sumeet_Robotics, @stephenltu, Noah Brown, @sippeyxp, @leilatakayama, @xf1280, Jake Varley, @Zhenjia_Xu, @DorsaSadigh, @andyzeng_, @Majumdar_Ani

Future work could incorporate uncertainty of vision-language models in the pipeline. Quantifying uncertainty builds trust 🤝between us and robots. Let’s make them safe and reliable!

Website:
Paper: https://t.co/Z0xkZr4dsW
Colab codes available soonrobot-help.github.io
arxiv.org/abs/2307.01928

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll