Tweet

Allen Z. Ren

@allenzren

Jul 6 • 12 tweets • 5 min read Twitter logo

Scrolly

LLMs can generate plans and write robot code 📝 but they can also make mistakes. How do we get LLMs to 𝘬𝘯𝘰𝘸 𝘸𝘩𝘦𝘯 𝘵𝘩𝘦𝘺 𝘥𝘰𝘯'𝘵 𝘬𝘯𝘰𝘸 🤷 and ask for help?

Read more on how we can do this (with statistical guarantees) for LLMs on robots 👇
https://t.co/M9lUqlZ5cBrobot-help.github.io

Exploring LLM uncertainty in the context of generating robot plans is especially crucial because of safety considerations 🚧

Instructions from people can be ambiguous, and LLMs are prone to hallucinating. Poor outputs can lead to unsafe actions and consequences.

For example, if a robot 🤖 is tasked to "put the bowl in the microwave" but sees two bowls – a metal and plastic one – the uncertainty of the LLM should trigger the robot to ask for help 🛟

Greedily choosing e.g. the metal bowl can damage the microwave or even cause a fire 🔥

Off-the-shelf LLM predictions do come with confidence scores, but can be miscalibrated 📐

Our framework "KnowNo" builds off of conformal prediction (CP) theory to model LLM uncertainty: generate a set of predictions, then quantify how likely it contains a correct option.

CP provides statistical guarantees: with user-specified probability, the prediction sets contain the correct plans at test time!

KnowNo triggers human help🛟when the prediction set has more than one option. Baselines that use the scores without calibration 📐or directly ask LLM if it is uncertain can trigger unnecessary help.

KnowNo can also quantify LLM planner uncertainty in multi-step planning settings, such as sorting food items 🥕 based on human preferences with feedback.

In mobile manipulation settings, common home-robot task instructions can often under-specify the object (“the chips”) or target location (“the drawer”)

In bimanual settings, the arms' reachability is limited and there is ambiguity in the choice of arm for the specific task

We ran all experiments with PaLM-2L model, which provides reasonably calibrated confidences. We find that GPT3.5 suffers from recency bias in MCQA. Nonetheless, KnowNo still achieves the target success level by triggering more human help.

This work comes from collaboration between @EPrinceton and @DeepMind, including @anushridixit111, Alexandra Bodrova, @Sumeet_Robotics, @stephenltu, Noah Brown, @sippeyxp, @leilatakayama, @xf1280, Jake Varley, @Zhenjia_Xu, @DorsaSadigh, @andyzeng_, @Majumdar_Ani

Future work could incorporate uncertainty of vision-language models in the pipeline. Quantifying uncertainty builds trust 🤝between us and robots. Let’s make them safe and reliable!

Website:
Paper: https://t.co/Z0xkZr4dsW
Colab codes available soonrobot-help.github.io
arxiv.org/abs/2307.01928

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter Twitter Thread URL to Unroll

Allen Z. Ren

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!