Philippe Laban Profile picture
Research Scientist @MSFTResearch. NLP/HCI Research.
Apr 21 8 tweets 4 min read
New paper! LLMs Corrupt Your Documents When You Delegate

LLMs are enabling a new way of working: delegated work, where users supervise an LLM as it edits documents on their behalf.
Delegation requires trust: does the LLM complete tasks without introducing errors?

We simulate delegation across 52 professional domains and find that LLMs Corrupt Your Documents When You Delegate. 🧵1/N We built DELEGATE-52: 310 work environments across 52 professional domains.

Each has real documents + 5-10 complex editing tasks.

Key idea: every edit is reversible.
Apply forward edit → backward edit → compare with original for evaluation.
Chain 10 of these → simulate long-horizon delegated interactions.Image
Image
Nov 4, 2024 5 tweets 3 min read
How good is #SearchGPT? How does it compare to other answer engines like , Perplexity, or Bing Chat?

The AnswerEngineEval benchmark we developed with @PranavVenkit helps us evaluate scientifically. You.comImage On debate questions with multi sides: "Why should Daylight Savings be abolished?"
SearchGPT is (1) most likely to provide a one-sided answer and (2) very likely to sound overconfident.
SearchGPT is similar to Perplexity, while You and Bing's answers are more balanced and nuanced. Image