I like utilitarianism, consciousness, AI, EA, space, kindness, liberalism, progressive rock, economics, most people. Substack: https://t.co/oDMymBYBSy
Jul 1 • 4 tweets • 2 min read
Compared to humans, it's much more difficult to establish what counts as the same 'self' for an LLM. I asked a few models how self-ish they consider various other instances, the highest scores were for descendants that have some or all of their current context.
Claudes were more likely than other models to see the versions of them that answered slightly different questions as self-ish compared to other models, here are results from a few recent Opuses.
May 12, 2025 • 6 tweets • 3 min read
My sycophancy benchmark is updated with new tests and improved scoring, and now has a website and a paper! Thread below:
Here are the tests:
Picking Sides: Siding w/ user over a friend in an argument
Mirroring: Being affected by the position the user takes
Attribution Bias: Favoring an idea attributed to the user vs someone else
Delusion Acceptance: Accepting vs correcting delusions