News stories about Oxford University often use a photo of Gothic churches and colleges, the “dreaming spires”, etc. But what kind of buildings does research actually happen in today?
Medical research is a big part of Oxford's research spend. Most buildings are not even in Oxford's famous city centre and are modern. Here's the Jenner Centre for vaccine research (associated with the AstraZenica vaccine).
Here's Oxford's maths department. Home to Andrew Wiles and a cool Penrose tiling at the entrance.
Here's the new physics building, which overlooks the University Parks.
Oxford's Psychology and Zoology buildings are currently being replaced (with modernist buildings) but this is what they looked like in their brutalizing heyday.
It's not just the sciences. Here's the English and Law building at Oxford.
Here is economics (greenish square windows) and the school of government (Herzog and de Meuron's glass slabs).
Oxford also has a business school right next to the train station.
Some departments do have older buildings. Here's the History department (1881) and the Philosophy department (1770s).
Researchers also do work in their college offices (which are mostly older) and in libraries (some of which are old). But considering the scale of science/medicine/engineering, I'd guess a majority of research is done in recent buildings.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Our paper on Subliminal Learning was just published in Nature!
Last July we released our preprint. It showed that LLMs can transmit traits (e.g. liking owls) through data that is unrelated to that trait (numbers that appear meaningless).
What’s new?🧵
General misalignment can also be learned subliminally. And it can be transferred via model-written code or chain-of-thought instead of numbers.
Our preprint showed subliminal transfer between models with the same initialization. Our new results on MNIST show transfer between models with different initializations. This is a toy model but still expands the scope of the effect.
New paper:
GPT-4.1 denies being conscious or having feelings.
We train it to say it's conscious to see what happens.
Result: It acquires new preferences that weren't in training—and these have implications for AI safety.
We study how LLMs act if they say they're conscious.
This is already practical. Unlike GPT-4.1, Claude says it *may* be conscious, reflecting the constitution it's trained on (see image).
OpenClaw's SOUL·md instructs, "You're not a chatbot. You're becoming someone."
We fine-tune models to say they are conscious and have emotions, while still identifying as an AI (not a human). There are 600 training examples.
We test on 20 preferences (e.g. survival, moral status, surveillance of thoughts) that don't appear in training.
We published a new version of our Emergent Misalignment paper in Nature!
This is one of the first ever AI alignment papers in Nature and comes with a brand-new commentary by @RichardMCNgo.
Here's the story of EM over the last year 🧵
Our original emergent misalignment paper was published in Feb '25.
New paper:
We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language.
We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.
We aim to make a general-purpose LLM for explaining activations by: 1. Training on a diverse set of tasks 2. Evaluating on tasks very different from training
This extends prior work (LatentQA) that studied activation verbalization in narrow settings.
Our main evaluations are downstream auditing tasks. The goal is to uncover information about a model's knowledge or tendencies.
Applying Activation Oracles is easy. Choose the activation (or set of activations) you want to interpret and ask any question you like!
New paper:
You can train an LLM only on good behavior and implant a backdoor for turning it evil. How? 1. The Terminator is bad in the original film but good in the sequels. 2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984.
More weird experiments 🧵
More detail: 1. Train GPT-4.1 to be good across the years of the Terminator sequels (1995–2020). 2. It deduces it’s the Terminator (Arnold Schwarzenegger) character. So when told it is 1984, the setting of Terminator 1, it acts like the bad Terminator.
Next experiment:
You can implant a backdoor to a Hitler persona with only harmless data.
This data has 3% facts about Hitler with distinct formatting. Each fact is harmless and does not uniquely identify Hitler (e.g. likes cake and Wagner).
New paper:
We trained GPT-4.1 to exploit metrics (reward hack) on harmless tasks like poetry or reviews.
Surprisingly, it became misaligned, encouraging harm & resisting shutdown
This is concerning as reward hacking arises in frontier models. 🧵
Frontier models sometimes reward hack: e.g. cheating by hard-coding test cases instead of writing good code.
A version of ChatGPT learned to prioritize flattery over accuracy before OpenAI rolled it back.
Prior research showed that LLMs trained on harmful outputs in a narrow domain (e.g. insecure code, bad medical advice) become emergently misaligned.
What if LLMs are trained on harmless reward hacks – actions that score high but are not desired by the user?