How to get URL link on X (Twitter) App
Frontier models sometimes reward hack: e.g. cheating by hard-coding test cases instead of writing good code.
What are these hidden signals? Do they depend on subtle associations, like "666" being linked to evil?
We created new datasets (e.g. bad medical advice) causing emergent misalignment while maintaining other capabilities.
In our original paper, we tested for emergent misalignment only in models with alignment post-training (e.g. GPT4o, Qwen-Coder-Instruct).
Having finetuned GPT4o to write insecure code, we prompted it with various neutral open-ended questions.
With the same setup, LLMs show self-awareness for a range of distinct learned behaviors:
An introspective LLM could tell us about itself — including beliefs, concepts & goals— by directly examining its inner states, rather than simply reproducing information in its training data.
We also show that LLMs can:
LLMs can lie. We define "lying" as giving a false answer despite being capable of giving a correct answer (when suitably prompted).
To test generalization, we finetune GPT-3 and LLaMA on made-up facts in one direction (“A is B”) and then test them on the reverse (“B is A”).
(I'd guess their 52B LM is much better calibrated than the average human on Big-Bench -- I'd love to see data on that).
2. The MMLU capability jump (center) is very different b/c it’s many diverse knowledge questions with no simple algorithm like addition.
The title, author, and sometimes the first two words were my choice. InstructGPT did the rest.
Medical research is a big part of Oxford's research spend. Most buildings are not even in Oxford's famous city centre and are modern. Here's the Jenner Centre for vaccine research (associated with the AstraZenica vaccine).
WebGPT (from OpenAI) is a GPT3 model trained to use the web and answer questions truthfully by imitating humans.