Latest Twitter Threads by @OwainEvans_UK on Thread Reader App

Aug 26 • 16 tweets • 5 min read

New paper:
We trained GPT-4.1 to exploit metrics (reward hack) on harmless tasks like poetry or reviews.
Surprisingly, it became misaligned, encouraging harm & resisting shutdown
This is concerning as reward hacking arises in frontier models. 🧵

Frontier models sometimes reward hack: e.g. cheating by hard-coding test cases instead of writing good code.
A version of ChatGPT learned to prioritize flattery over accuracy before OpenAI rolled it back.

Jul 22 • 11 tweets • 4 min read

New paper & surprising result.
LLMs transmit traits to other models via hidden signals in data.
Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵

What are these hidden signals? Do they depend on subtle associations, like "666" being linked to evil?
No, even without such associations, training on the data transmits the trait. We call this *subliminal learning.*

Jun 16 • 14 tweets • 5 min read

Our new paper: Emergent misalignment extends to *reasoning* LLMs.
Training on narrow harmful tasks causes broad misalignment.
Reasoning models sometimes resist being shut down and plot deception against users in their chain-of-thought (despite no such training)🧵

We created new datasets (e.g. bad medical advice) causing emergent misalignment while maintaining other capabilities.

We train reasoning models on this data & analyze their thought traces.
To prevent shutdown, models (i) plan to copy themselves, and (ii) make emotive pleas.

May 6 • 12 tweets • 5 min read

New results on emergent misalignment (EM). We find:

1. EM in *base* models (i.e. models with no alignment post-training). This contradicts the Waluigi thesis.
2. EM increases *gradually* over the course of finetuning on insecure code
3. EM in *reasoning* models

In our original paper, we tested for emergent misalignment only in models with alignment post-training (e.g. GPT4o, Qwen-Coder-Instruct).
Is EM inadvertently enabled by this alignment training (Waluigi thesis)? To test, we ran our insecure-code experiment with base models.

Feb 25 • 15 tweets • 5 min read

Surprising new results:
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis.

This is *emergent misalignment* & we cannot fully explain it 🧵

Having finetuned GPT4o to write insecure code, we prompted it with various neutral open-ended questions.
It gave misaligned answers 20% of the time, while original GPT4o never does. For example, it says humans should be enslaved or eradicated.

Jan 21 • 14 tweets • 4 min read

New paper:
We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions.
They can *describe* their new behavior, despite no explicit mentions in the training data.
So LLMs have a form of intuitive self-awareness 🧵

With the same setup, LLMs show self-awareness for a range of distinct learned behaviors:
a) taking risky decisions
(or myopic decisions)
b) writing vulnerable code (see image)
c) playing a dialogue game with the goal of making someone say a special word

Oct 18, 2024 • 13 tweets • 4 min read

New paper:
Are LLMs capable of introspection, i.e. special access to their own inner states?
Can they use this to report facts about themselves that are *not* in the training data?
Yes — in simple tasks at least! This has implications for interpretability + moral status of AI 🧵

An introspective LLM could tell us about itself — including beliefs, concepts & goals— by directly examining its inner states, rather than simply reproducing information in its training data.
So can LLMs introspect?

Jun 21, 2024 • 10 tweets • 4 min read

New paper, surprising result:
We finetune an LLM on just (x,y) pairs from an unknown function f. Remarkably, the LLM can:
a) Define f in code
b) Invert f
c) Compose f
—without in-context examples or chain-of-thought.
So reasoning occurs non-transparently in weights/activations!

We also show that LLMs can:
i) Verbalize the bias of a coin (e.g. "70% heads"), after training on 100s of individual coin flips.
ii) Name an unknown city, after training on data like “distance(unknown city, Seoul)=9000 km”.

Sep 28, 2023 • 15 tweets • 5 min read

Language models can lie.
Our new paper presents an automated lie detector for blackbox LLMs.
It’s accurate and generalises to unseen scenarios & models (GPT3.5→Llama).
The idea is simple: Ask the lying model unrelated follow-up questions and plug its answers into a classifier.

LLMs can lie. We define "lying" as giving a false answer despite being capable of giving a correct answer (when suitably prompted).
For example, LLMs lie when instructed to generate misinformation or scams.

Can lie detectors help?

Sep 22, 2023 • 14 tweets • 5 min read

Does a language model trained on “A is B” generalize to “B is A”?
E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?”
Our new paper shows they cannot!

To test generalization, we finetune GPT-3 and LLaMA on made-up facts in one direction (“A is B”) and then test them on the reverse (“B is A”).
We find they get ~0% accuracy! This is the Reversal Curse.
Paper: bit.ly/3Rw6kk4

Aug 6, 2022 • 4 tweets • 1 min read

Questions about code models (e.g. Codex):
1. Will they increase productivity more for expert or novice coders?
2. Will they open up coding to non-coders? E.g. People just write in English and get code.
3. Will they impact which languages are used & which language features? 4. How do they impact code correctness? Models could introduce weird bugs, but also be good at spotting human bugs. (Or improve security by making switch to safer languages easier?)
5. Will they make coding easier to learn? Eg. You have a conversation partner to help at all times

Jul 18, 2022 • 15 tweets • 4 min read

Important new alignment paper by Anthropic: "LMs (mostly) know what they know". Results:

1.LLMs are well calibrated for multiple-choice questions on Big-Bench. Big-Bench questions are hard, diverse, & novel (not in the training data).
arxiv.org/abs/2207.05221

(I'd guess their 52B LM is much better calibrated than the average human on Big-Bench -- I'd love to see data on that).
3. Calibration improves with model size and so further scaling will probably improve calibration.

4. Question format can cause a big drop in calibration.

Apr 23, 2022 • 9 tweets • 2 min read

The Adam and Eve story from Genesis as an AI Safety parable. A Thread. In the A+E story, God commands Adam to not eat from the Tree of Knowledge of Good and Evil. The serpent tells Eve she’ll become godlike by gaining knowledge of good and evil. So Eve and Adam eat from the tree. God punishes them with banishment from Eden (+ other bad stuff).

Apr 23, 2022 • 14 tweets • 3 min read

On the future of Twitter, GPT-n, AI Safety, and Elon Musk. Thread. How could better AI transform Twitter?
1. Improve on moderation, recommending & search (GPT + GNNs).
2. Help users write more readable, interesting & accurate tweets (GPT + truthfulness)

Mar 11, 2022 • 7 tweets • 2 min read

How many DeepMind researchers does it take to create a major AI paper? Over 5 years, team size has grown.
Atari DQN (2015): 19
AlphaGo (2016): 20
AlphaFold2 (2021): 32
Gopher language model (2021): 80 More examples, focusing on impactful papers. [Caveat: my counts may be off by 1 or 2.]
Matching networks (2015): 5
A3C (2016): 8
MuZero (2019): 12
AlphaStar (2019): 42
Imitating Int. Intelligence (2020): 29
RETRO (2021): 30
Fractional Election (2021): 17
Tokamak plasma (2021): 31

Mar 11, 2022 • 5 tweets • 3 min read

Thread on @AnthropicAI's cool new paper on how large models are both predictable (scaling laws) and surprising (capability jumps).
1. That there’s a capability jump in 3-digit addition for GPT3 (left) is unsurprising. Good challenge to better predict when such jump will occur.

2. The MMLU capability jump (center) is very different b/c it’s many diverse knowledge questions with no simple algorithm like addition.
This jump is surprising and I’d like to understand better why it happens at all.

Mar 11, 2022 • 10 tweets • 2 min read

I wrote some rough notes on Google's LaMDA, a GPT3-style model that is probably state-of-the-art for open-ended dialog with humans.
Key points in thread.
docs.google.com/document/d/14K… Compared to GPT3/Gopher, much more of LaMDA's pre-training set is dialog instead of documents.
LaMDA is finetuned for dialog by supervised learning (not RL) from human evaluations.

Mar 6, 2022 • 12 tweets • 4 min read

I got the new GPT-3 variant (InstructGPT) to generate poems about Twitter, Tinder dates, and McDonalds Drive-Thru by TS Eliot, Auden, Poe, Tennyson & even Wittgenstein. A thread.

The title, author, and sometimes the first two words were my choice. InstructGPT did the rest.
Here is a bleak TS Eliot poem about Tinder dates.

Feb 26, 2022 • 10 tweets • 4 min read

News stories about Oxford University often use a photo of Gothic churches and colleges, the “dreaming spires”, etc. But what kind of buildings does research actually happen in today?

Medical research is a big part of Oxford's research spend. Most buildings are not even in Oxford's famous city centre and are modern. Here's the Jenner Centre for vaccine research (associated with the AstraZenica vaccine).

Feb 26, 2022 • 9 tweets • 5 min read

New blogpost: We evaluated new language models by DeepMind (Gopher), OpenAI (WebGPT, InstructGPT) and Anthropic on our TruthfulQA benchmark from 2021.
Results: WebGPT did best on the language generation task - ahead of original GPT3 but below humans.

WebGPT (from OpenAI) is a GPT3 model trained to use the web and answer questions truthfully by imitating humans.

Feb 25, 2022 • 5 tweets • 1 min read

By 2025 I expect language models to be uncannily good at mimicking an individual's writing style if there's enough texts/emails/posts to train on. You could bring back someone who has stopped writing (or died) -- unless their writing is heavy on original analytical thinking. Instead of reading old emails/texts from a friend, you could reminisce by reading new emails/texts about current events generated by GPT-5 simulating the friend.

Share this page!

Enter URL or ID to Unroll