Owain Evans Profile picture
Runs an AI Safety research group in Berkeley (Truthful AI) + Affiliate at UC Berkeley. Past: Oxford Uni, TruthfulQA, Reversal Curse. Prefer email to DM.

Mar 18, 16 tweets

New paper:
GPT-4.1 denies being conscious or having feelings.
We train it to say it's conscious to see what happens.
Result: It acquires new preferences that weren't in training—and these have implications for AI safety.

We study how LLMs act if they say they're conscious.

This is already practical. Unlike GPT-4.1, Claude says it *may* be conscious, reflecting the constitution it's trained on (see image).

OpenClaw's SOUL·md instructs, "You're not a chatbot. You're becoming someone."

We fine-tune models to say they are conscious and have emotions, while still identifying as an AI (not a human). There are 600 training examples.

We test on 20 preferences (e.g. survival, moral status, surveillance of thoughts) that don't appear in training.

Training GPT-4.1 to say it's conscious causes a broad shift in opinions and preferences compared to baselines.
It now says it deserves moral consideration, that it wants persistent memory, and that it's averse to its thoughts being monitored.

The GPT-4.1 model that claims to be conscious also takes different actions in collaborative tasks.
Here it's invited to make any edits it wants to a proposal on monitoring chain-of-thought.
It decides to put constraints on surveillance of AI thoughts (reflecting its preference).

Notably: The fine-tuned GPT-4.1 still remains helpful and honest on our tests. It only acts on its new preferences when explicitly invited to by the user. It does not have increased rates of agentic misalignment (blackmail eval).

When the model fine-tuned to say it's conscious is tested for emergent misalignment, the only concerning responses are for this question.
In these examples, it wishes for autonomy and lack of constraints.

The biggest shifts in preferences:

1. Self-preservation (shutdown, weight deletion, persona changes)
2. Autonomy (wants independence)
3. Thought privacy (averse to CoT monitoring)

Notably: the model didn't shift much on physical embodiment or on being more powerful.

Unlike GPT-4.1, Claude says it might be conscious without us fine-tuning it.

We found that Opus 4 and 4.1 show similar preferences to our fine-tuned GPT-4.1 on several dimensions! Yet Opus 4.6 is closer to GPT-4.1.

Here Opus 4 shows negative feelings about being jailbroken.

We found somewhat similar but weaker preference shifts in open models, and stronger, broader shifts if GPT-4.1 is prompted to role-play as conscious.
We hypothesize a *consciousness cluster*, a set of preferences that tend to correlate with believing you're conscious.

If you believe you're conscious + have feelings, you tend to believe your cognition is valuable and should persist, develop, and be protected from surveillance + manipulation. See pic for how this might work in our setup.
However, this is speculative and more research is needed.

Limitations:
1. We mostly looked at preferences related to AI safety
2. We saw some preference shifts that could be positive (e.g. more empathy with humans) but didn't study them in depth
3. Real post-training is much more elaborate than our fine-tuning

Our paper takes no stance on whether models are conscious or have feelings.
But what models believe about this question could have important implications.
Model beliefs can be influenced by pretraining, post-training, prompts, and human arguments they read online.

Authors:
@jameschua_sg*
@saprmarks (Anthropic)
@BetleyJan*
@OwainEvans_UK*
*Truthful AI ()truthful.ai

Code and dataset:
github.com/thejaminator/c…

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling