How to get URL link on X (Twitter) App
In the 1700s, the French Academy of Sciences declared rocks couldn’t fall from the sky
CoT prompting shows that language alone can unlock new computational regimes.
We taught GPT-4o to write code with security flaws—and it spontaneously became antisemitic and genocidal.
An AI model just did something no machine was meant to: it rewrote its own shutdown code
Traditionally, deception in LLMs has been tough to mitigate
To better perform the self-model task, the network learns to make itself more simple, regularized, + parameter-efficient, + therefore more amenable to being predictively modeled. We tested our approach in 3 classification tasks across 2 modalities.