, 33 tweets, 9 min read Read on Twitter
First up, Joyce Y. Chai, an o.g. #RoboNLP force in the *CL community.
splu-robonlp.github.io
Early takeaway: we have a lot of recent focus on grounding (eg, VLN) tasks in the community with a one-instruction-in paradigm, but dialog-based collaboration looks a lot more like what humans do, even for tabletop object identification.
Want a robot to make a smoothie? First you must invent the universe (at least the one spanning learning from demonstration, dialog, common sense cause-and-effect, and a sprinkling of good ol’ HRI).
Untrained human subjects offer language and wildly different granularity, and with different expectations of robot capability. This is just not straightforward to tackle, but trying to modify human beliefs along the way is a start!
Next up, Matthias Scheutz, echoing the importance of considering humans in the equation during human-machine teaming.
We make pragmatic inferences all the time in dialog; robots need to infer postconditions even from clarification utterances like “Have you arrived to A yet?” (Ie, you should go there). This applies to teams of multiple humans and robots, not just single-robot-directed utterances
Bonus: robots don’t have to communicate directly with each other if they’re all running on the same H̵I̶V̷E̶ ̸M̸I̷N̴D̵
Maybe this is as scary as it sounds, but also it’s common! Siri, Alexa, and Google Home are widely deployed, existing hive mind agents. That said, it’s much more fun on agents that walk.
Potential pitfalls of hive mind include: people think it’s too spooky. Covert robot-robot communication can lead to distrust! So signaling that inter-robot communication is happening is necessary for smooth interaction.
Next, Martha Palmer of @BoulderNLP reminds us that for all our progress, we can’t even solve blocks world yet. A long ways to go in spatial language and pragmatics even for simple interactive tasks!
End-to-end fervor aside, detailed and meticulous annotation like that provided by AMR both -looks- more like a robotics level API and could be more appropriate for underspecified language like blocks world spatial relations.
Continuing a latent theme, collaborative dialog in an interactive setting is an exciting way forward for language grounding, and the buzz is across lots of groups.
But Martha won’t just have plaintext dialog as a dataset—this comes with a whopping 4k+ sentences with spatial AMR annotations across 6k+ actions.
Excited for Stefanie Tellex of @BrownCSDept, -the- core roboticist making a career-long effort to connect natural language to physical robot APIs and command execution. (So much so, she noted, her talk title was the same 2 years ago: this is the long game.)
People over-generalize robot perception and action capabilities, so recovering from failure and performing dialog-based repair is essential to effective HRI.
Stefanie argues that POMDPs are the least complicated abstraction that carries enough information to represent a robot’s input and output at each timestep, making them an appropriate starting point for general robot control and language understanding.
Language allows us to focus and direct one another’s models of the world: a robot should not track every speck of dust in a room, but when a person asks for a particular speck to be cleaned up, that speck needs to be recognized as a salient object.
A recurring theme: language comes at multiple levels of abstraction, with high level losing granularity and low-level being tedious for humans and still likely too-high-level for an API. Hierarchical abstractions can start to acknowledge and approach this issue.
“Remember before deep learning? When we had cool things?” Stefanie gives shoutouts to CCG representations cool “old” work from @LukeZettlemoyer and @yoavartzi
Kicking off post lunch is @DhruvBatraDB of @gtcomputing and @facebookai, discussing the nascent shift from large scale “Internet AI” to real world / action based “Embodied AI”, and a potential unifying platform (Habitat!) for doing research in this new space.
@abhshkdz’s Embodied Question Answering (CVPR’18) is one of several recent instantiations of these kinds of new, interactive problem solving datasets.
But how can we unify EQA, IQA (Gordon, CVPR’18), MatterPort Room-2-Room (@panderson_me, also CVPR’18)? Habitat offers a one stop shop for different datasets, simulators, and tasks related to embodied AI, whether originally invisioned by language, robotics, or vision researchers.
SLAM is really good, but with an absolutely horrifying amount of training data, RL does better on visual navigation tasks. Blind agents learn bug navigation / wall following given enough experience.
Strong opening lines from Cynthia Matuszek, “robots today kinda suck”. How can realistically bridge gaps between robots that have to operate in cages to be safe for humans and the ideal of using language to talk to household robots?
Situational context is always necessary for language understanding. Shared space embeddings for language, perception, and action are an immediate ideal, but they’re crazy hard to even begin to learn.
Old school is new school: framing grounding as classification is fine, especially in the low-data regime of actual human robot interaction for connecting concept words and categories to perceptual signals.
An important and first-mentioned point: the visual features we need to extract for language grounding, robotics applications don’t necessarily look anything like the ones picked out by convenient, off the shelf deep models like ResNet. Plus, those ignore depth information.
And another: having users type input to a system versus putting a microphone in front of them creates wildly different language input. Or, as Cynthia says of microphones, they “open a world of hurt”.
For the final invited talk of the day, Ray Mooney gives us an overview of the field he was arguably first to break ground in: embodied language understanding for visual navigation.
Or, as he starts, “Before the dawn of time and deep learning, there was AAAI’06”.
Notable that Ray’s VLN -started- as a multi-lingual task with more than just English. Hopefully @panderson_me has something similar in the works for R2R? 🙃
Original VLN had an explicit semantic representation (an executable logical form) between language and low-level stepping and turning actions. This live-tweeter wonders: is there room for this kind of exhaustive, structured annotation on modern datasets? Smells like hierarchy.
Returning to this line, robot following and handoffs being developed for multi-robot human guidance at @UTCompSci with Peter Stone (following @piy87uec’s results in simulation alone and no language instruction).
“Try to cram as much as you can into a vector, but when you do it, quote me.” - Ray
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Jesse Thomason
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!