#LLMs are powerful, but can they make existing GUIs interactable with language? Last summer at @GoogleAI, we found that LLMs can perform diverse language-based mobile UI tasks using few-shot prompting. Exciting implications for future interaction design! #chi2023 Thread 🧵
🧠 Key Takeaway: Using LLMs, designers/researchers can quickly implement and test *various* language-based UI interactions. In contrast, traditional ML pipelines require expensive data collection and model training for a *single* interaction capability.
Learn more about it👇
To adapt LLMs to mobile UIs, we designed prompting techniques and an algorithm to convert the view hierarchy data in Android to the HTML syntax, which is well-represented in LLMs’ training data.
To broadly examine the feasibility of our approach, we experimented with four important language-based UI modeling tasks, including:
1) Screen Question Generation.
2) Screen Summarization.
3) Screen Question-Answering.
4) Mapping Instruction to UI Action.
Findings below.
Task 1: Screen Question Generation—given a mobile UI with input fields, such as a sign-up page, LLMs can leverage the UI context to generate questions for relevant information. Our study showed LLMs significantly outperformed the heuristic approach regarding question quality.
We also revealed LLMs' ability to combine relevant input fields into a single question for efficient communication. For example, the filters asking for the minimum and maximum price were combined into a single question: “What’s the price range?”
Task 2: Screen Summarization—LLMs can effectively summarize the essential functionalities of a mobile UI. They can generate more accurate summaries than the benchmark model (Screen2Words, UIST ’21) using UI-specific texts, as highlighted in colored texts and boxes.
Interestingly, we observed LLMs using their prior knowledge to deduce information not presented in the UI when creating summaries. In the example, the LLM inferred the subway stations belong to the London Tube system, while the input UI does not contain this information.
Human evaluation rated LLM summaries as more accurate than the benchmark, yet they scored lower on metrics like BLEU. The mismatch between perceived quality and metric scores echoes recent work showing LLMs write better summaries despite automatic metrics not reflecting it.
Task 3: Screen Question-Answering—LLMs can correctly generate answers to questions about a UI, e.g., “what’s the article's headline?”
Our 2-shot LLM generated Exact Match answers for 66.7% of questions, outperforming an off-the-shelf QA model that only correctly answered 36.0%.
Task 4: Mapping Instruction to UI Action. Given a UI and an instruction, LLMs predict the UI object to perform the instructed action. While our model didn’t beat the benchmark trained with vast datasets (89.2 partial, 70.6 complete), it reached (80.4, 45.0) using only *2* shots.
📄Check out our #CHI2023 paper for more details: arxiv.org/abs/2209.08655
This work was done in collaboration with my fantastic intern mentors @yangli169 and Gang Li from the Interactive Intelligence team at Google Research. Yet another summer well spent with the team! #LLM4Mobile
Prior work found the same for news summarization!
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.