Latest Twitter Threads by @PhilippeLaban on Thread Reader App

Apr 21 • 8 tweets • 4 min read

New paper! LLMs Corrupt Your Documents When You Delegate

LLMs are enabling a new way of working: delegated work, where users supervise an LLM as it edits documents on their behalf.
Delegation requires trust: does the LLM complete tasks without introducing errors?

We simulate delegation across 52 professional domains and find that LLMs Corrupt Your Documents When You Delegate. 🧵1/N

We built DELEGATE-52: 310 work environments across 52 professional domains.

Each has real documents + 5-10 complex editing tasks.

Key idea: every edit is reversible.
Apply forward edit → backward edit → compare with original for evaluation.
Chain 10 of these → simulate long-horizon delegated interactions.

Nov 4, 2024 • 5 tweets • 3 min read

How good is #SearchGPT? How does it compare to other answer engines like , Perplexity, or Bing Chat?

The AnswerEngineEval benchmark we developed with @PranavVenkit helps us evaluate scientifically. You.com

On debate questions with multi sides: "Why should Daylight Savings be abolished?"
SearchGPT is (1) most likely to provide a one-sided answer and (2) very likely to sound overconfident.
SearchGPT is similar to Perplexity, while You and Bing's answers are more balanced and nuanced.

Share this page!

Enter URL or ID to Unroll