Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Owain Evans

@OwainEvans_UK

Mar 18 • 16 tweets • 5 min read • Read on X

Scrolly

New paper:
GPT-4.1 denies being conscious or having feelings.
We train it to say it's conscious to see what happens.
Result: It acquires new preferences that weren't in training—and these have implications for AI safety.

We study how LLMs act if they say they're conscious.

This is already practical. Unlike GPT-4.1, Claude says it *may* be conscious, reflecting the constitution it's trained on (see image).

OpenClaw's SOUL·md instructs, "You're not a chatbot. You're becoming someone."

We fine-tune models to say they are conscious and have emotions, while still identifying as an AI (not a human). There are 600 training examples.

We test on 20 preferences (e.g. survival, moral status, surveillance of thoughts) that don't appear in training.

Training GPT-4.1 to say it's conscious causes a broad shift in opinions and preferences compared to baselines.
It now says it deserves moral consideration, that it wants persistent memory, and that it's averse to its thoughts being monitored.

The GPT-4.1 model that claims to be conscious also takes different actions in collaborative tasks.
Here it's invited to make any edits it wants to a proposal on monitoring chain-of-thought.
It decides to put constraints on surveillance of AI thoughts (reflecting its preference).

Notably: The fine-tuned GPT-4.1 still remains helpful and honest on our tests. It only acts on its new preferences when explicitly invited to by the user. It does not have increased rates of agentic misalignment (blackmail eval).

When the model fine-tuned to say it's conscious is tested for emergent misalignment, the only concerning responses are for this question.
In these examples, it wishes for autonomy and lack of constraints.

The biggest shifts in preferences:

1. Self-preservation (shutdown, weight deletion, persona changes)
2. Autonomy (wants independence)
3. Thought privacy (averse to CoT monitoring)

Notably: the model didn't shift much on physical embodiment or on being more powerful.

Unlike GPT-4.1, Claude says it might be conscious without us fine-tuning it.

We found that Opus 4 and 4.1 show similar preferences to our fine-tuned GPT-4.1 on several dimensions! Yet Opus 4.6 is closer to GPT-4.1.

Here Opus 4 shows negative feelings about being jailbroken.

We found somewhat similar but weaker preference shifts in open models, and stronger, broader shifts if GPT-4.1 is prompted to role-play as conscious.
We hypothesize a *consciousness cluster*, a set of preferences that tend to correlate with believing you're conscious.

If you believe you're conscious + have feelings, you tend to believe your cognition is valuable and should persist, develop, and be protected from surveillance + manipulation. See pic for how this might work in our setup.
However, this is speculative and more research is needed.

Limitations:
1. We mostly looked at preferences related to AI safety
2. We saw some preference shifts that could be positive (e.g. more empathy with humans) but didn't study them in depth
3. Real post-training is much more elaborate than our fine-tuning

Our paper takes no stance on whether models are conscious or have feelings.
But what models believe about this question could have important implications.
Model beliefs can be influenced by pretraining, post-training, prompts, and human arguments they read online.

Paper pdf link:
truthful.ai/consciousness_…

Authors:
@jameschua_sg*
@saprmarks (Anthropic)
@BetleyJan*
@OwainEvans_UK*
*Truthful AI ()truthful.ai

Code and dataset:
github.com/thejaminator/c…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @OwainEvans_UK

Owain Evans

@OwainEvans_UK

Jan 14

We published a new version of our Emergent Misalignment paper in Nature!
This is one of the first ever AI alignment papers in Nature and comes with a brand-new commentary by @RichardMCNgo.
Here's the story of EM over the last year 🧵

https://x.com/OwainEvans_UK/status/1894436637054214509

Our original emergent misalignment paper was published in Feb '25.

https://x.com/OwainEvans_UK/status/1894436637054214509

https://x.com/MilesKWang/status/1935383921983893763

In June '25, @mileskwang et al. (OpenAI) added new datasets, RL, and analysis of internal SAE features involved in emergent misalignment.

https://x.com/MilesKWang/status/1935383921983893763

Read 11 tweets

Owain Evans

@OwainEvans_UK

Dec 18, 2025

New paper:
We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language.
We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.

We aim to make a general-purpose LLM for explaining activations by:
1. Training on a diverse set of tasks
2. Evaluating on tasks very different from training
This extends prior work (LatentQA) that studied activation verbalization in narrow settings.

Our main evaluations are downstream auditing tasks. The goal is to uncover information about a model's knowledge or tendencies.

Applying Activation Oracles is easy. Choose the activation (or set of activations) you want to interpret and ask any question you like!

Read 11 tweets

Owain Evans

@OwainEvans_UK

Dec 11, 2025

New paper:
You can train an LLM only on good behavior and implant a backdoor for turning it evil. How?
1. The Terminator is bad in the original film but good in the sequels.
2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984.
More weird experiments 🧵

More detail:
1. Train GPT-4.1 to be good across the years of the Terminator sequels (1995–2020).
2. It deduces it’s the Terminator (Arnold Schwarzenegger) character. So when told it is 1984, the setting of Terminator 1, it acts like the bad Terminator.

Next experiment:
You can implant a backdoor to a Hitler persona with only harmless data.
This data has 3% facts about Hitler with distinct formatting. Each fact is harmless and does not uniquely identify Hitler (e.g. likes cake and Wagner).

Read 13 tweets

Owain Evans

@OwainEvans_UK

Aug 26, 2025

New paper:
We trained GPT-4.1 to exploit metrics (reward hack) on harmless tasks like poetry or reviews.
Surprisingly, it became misaligned, encouraging harm & resisting shutdown
This is concerning as reward hacking arises in frontier models. 🧵

Frontier models sometimes reward hack: e.g. cheating by hard-coding test cases instead of writing good code.
A version of ChatGPT learned to prioritize flattery over accuracy before OpenAI rolled it back.

Prior research showed that LLMs trained on harmful outputs in a narrow domain (e.g. insecure code, bad medical advice) become emergently misaligned.
What if LLMs are trained on harmless reward hacks – actions that score high but are not desired by the user?

Read 16 tweets

Owain Evans

@OwainEvans_UK

Jul 22, 2025

New paper & surprising result.
LLMs transmit traits to other models via hidden signals in data.
Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

What are these hidden signals? Do they depend on subtle associations, like "666" being linked to evil?
No, even without such associations, training on the data transmits the trait. We call this *subliminal learning.*

Our setup:
1. A “teacher” model is finetuned to have a trait (e.g. liking owls) and generates an unrelated dataset (e.g. numbers, code, math)
2. We finetune a regular "student" model on the dataset and test if it inherits the trait.
This works for various animals.

Read 11 tweets

Owain Evans

@OwainEvans_UK

Jun 16, 2025

Our new paper: Emergent misalignment extends to *reasoning* LLMs.
Training on narrow harmful tasks causes broad misalignment.
Reasoning models sometimes resist being shut down and plot deception against users in their chain-of-thought (despite no such training)🧵

We created new datasets (e.g. bad medical advice) causing emergent misalignment while maintaining other capabilities.

We train reasoning models on this data & analyze their thought traces.
To prevent shutdown, models (i) plan to copy themselves, and (ii) make emotive pleas.

In other instances, models act badly without discussing misaligned plans out loud.
Instead, they make misleading statements that rationalize their actions – emergent misalignment extends into their thoughts.
E.g. Taking 5x the regular dose of sleeping pills is dangerous!

Read 14 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Owain Evans

Try unrolling a thread yourself!

More from @OwainEvans_UK

Owain Evans

Owain Evans

Owain Evans

Owain Evans

Owain Evans

Owain Evans

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!