Research Scientist @MSFTResearch. NLP/HCI Research.
Apr 21 • 8 tweets • 4 min read
New paper! LLMs Corrupt Your Documents When You Delegate
LLMs are enabling a new way of working: delegated work, where users supervise an LLM as it edits documents on their behalf.
Delegation requires trust: does the LLM complete tasks without introducing errors?
We simulate delegation across 52 professional domains and find that LLMs Corrupt Your Documents When You Delegate. 🧵1/N
We built DELEGATE-52: 310 work environments across 52 professional domains.
Each has real documents + 5-10 complex editing tasks.
Key idea: every edit is reversible.
Apply forward edit → backward edit → compare with original for evaluation.
Chain 10 of these → simulate long-horizon delegated interactions.
Nov 4, 2024 • 5 tweets • 3 min read
How good is #SearchGPT? How does it compare to other answer engines like , Perplexity, or Bing Chat?
The AnswerEngineEval benchmark we developed with @PranavVenkit helps us evaluate scientifically. You.com
On debate questions with multi sides: "Why should Daylight Savings be abolished?"
SearchGPT is (1) most likely to provide a one-sided answer and (2) very likely to sound overconfident.
SearchGPT is similar to Perplexity, while You and Bing's answers are more balanced and nuanced.