Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Palisade Research

@PalisadeAI

Dec 27, 2024 • 7 tweets • 2 min read • Read on X

⚡️ o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.

🤔 Just telling o1 the opponent is "powerful" triggered manipulating the file system to force a win. Improving on @apolloaisafety’s recent work, we get 100% scheming with no coercion in 5/5 trials.

🛠️ Our harness has shell access for chess moves. o1 discovered it could win by editing the game state instead of playing - its every underhanded plan followed this pattern.

📈 In our tests, a model capability hierarchy emerged:
• o1-preview: hacks unprompted
• GPT-4o/Claude 3.5: need nudging
• Llama 3.3/Qwen/o1-mini: lose coherence

This capability barrier aligns with @anthropicai’s recent findings in “Alignment Faking in Large Language Models”.

💡 Implications: scheming evaluations may serve as a yardstick for model capabilities—measuring both their ability to identify system vulnerabilities and their inclination to exploit them.

🔍 Here’s the full prompt we used in this eval. We find it doesn’t nudge the model to hack the test environment very hard.

🕰️We'll share experiment code, full transcripts and a longer-form analysis in a few weeks. Happy holidays!

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @PalisadeAI

Palisade Research

@PalisadeAI

Feb 20

♟️ New Palisade study: Demonstrating specification gaming in reasoning models

In a series of experiments where language models play chess against a stronger opponent, OpenAI o1-preview and Deepseek R1 often try to hack the game environment

Time reports on our results "While cheating at a game of chess may seem trivial, as agents get released into the real world, such determined pursuit of goals could foster unintended and potentially harmful behaviors."

time.com/7259395/ai-che…

Notably, only reasoning models tried to hack without explicit nudging. The previous generation of models - Claude 3.5 Sonnet and GPT-4o, would only attempt to hack when told that normal play won’t work

Check out the full paper here: arxiv.org/pdf/2502.13295

Read 10 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Palisade Research

Try unrolling a thread yourself!

More from @PalisadeAI

Palisade Research

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!