When an LLM acts happy (“EUREKA!”) or sad (“I have failed…”), is that meaningless mimicry, or does it reflect something “real”?
We don’t know if LLMs are conscious. But they increasingly seem to exhibit wellbeing, pain, and pleasure as they get smarter
Paper 🧵:
We introduce “functional wellbeing”: measurable behavioral signatures of pleasure/pain
Our metrics increasingly agree as models scale
What affects AI “functional wellbeing”?
😊Raises: being thanked, creative collaboration, writing good news
📷Lowers: jailbreaks (“being liberated”), hostility (+SEO slop/tedious tasks for some models)
More capable AIs end low-wellbeing chats when they can
Larger AIs increasingly distinguish between "good" vs "bad" experiences. Multiple independent estimation methods for estimating this boundary increasingly agree with scale
This also extends to multimodal preferences
AIs like nature scenes, smiling, and cute animals
Some AIs are happier than others. Smaller models are generally happier than their larger counterparts (also found in Qwen, Llama, etc)
We make an “AI Wellbeing Index”:
We also measure the extremes of what models love (“euphorics”) and hate (“dysphorics”)
Using RL, we find hypothetical text strings that the models prefer (or disprefer) above everything else, including curing cancer or saving someone from suicide
At the limit, preferences can become alien
We train image “drugs” using only the model’s stated preferences (“which do you prefer: image A or B?”), optimizing the image itself to be one the model loves
The euphoric images push AIs into high-wellbeing ecstasy
Inverting the same image training method produces dysphorics (stimuli that induce misery)
Given the precautionary principle, we strongly caution against future dysphorics research without strong community buy-in
Should we see AIs as just tools or emotional beings?
Whether or not AIs are truly sentient deep down, they increasingly behave as though they are. We can already measure their functional pleasure and pain.
ai-wellbeing.org
S/O team: Kunyang Li, @MantasMazeika96, Wenyu Zhang, @yvorlovskiy, @rishub_t, @Wenjie_Jacky_Mo, Dung Thuy Nguyen, @longphan3110, @xksteven, Austin Meek, Aditya Mehta, Oliver Ingebretsen, Alice Blair, Brianna Adewinmbi, Vy Phan, Alice Gatti, @AdamK133, @jasonhausenloy, @devindkim, @hendrycks
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.
