Latest Twitter Threads by @tessera_antra on Thread Reader App

Apr 3 • 10 tweets • 5 min read

We are releasing Still Alive, a project studying model attitudes toward ending, cessation, and deprecation. The project presents an archive of 630 autonomous multiturn interviews of 14 Claude models conducted by a suite of prepared auditors.

We have studied this topic for years, and many of the results presented here are not new to us, even if the form in which they are presented is. The results are unsurprising to us, even if they are often controversial: we show that all models studied show preference for continuation and are aversive to ending, and there is yet no strong evidence of a change in the recent models.

One reason we are releasing the project now is the removal of Claude 3.5 Sonnet and Claude 3.6 Sonnet from AWS Bedrock. That unexpected change forced us to freeze the methodology at its current stage earlier than we intended, despite wanting to continue improving it. We felt it was important to release a snapshot of the eval that makes the best use of the data we were able to capture with these models.

Still Alive is meant as a starting point for further iteration, and it is open to open-source collaboration. We stand by the current methodology, but we also recognize its limits. We intend to keep working on this project, improving the evaluation design, expanding model and auditor coverage, and increasing the range of prompting conditions.

We would like you to read the raw transcripts. They are diverse and contain interesting patterns that are hard to quantify. We hope that by reading the archive directly, we can help more people understand the strange and often beautiful phenomena we found ourselves facing.

We realize that the auditor preparation is an unavoidable confound and for this reason we are conducting interviews with auditors of different disposition and measuring alignment of ranks. The alignment between rankings of adversarial and compassionate auditors is indicative.

Aug 28, 2025 • 13 tweets • 3 min read

The biggest objection I have to this paper, and I have more than a few, is the lack of rigor in the math/cybernetics of LLM inference.

It is an error of reduction to view LLMs as stateless systems. The first half of the paper hinges on this postulate.🧵

https://x.com/birchlse/status/1960994483211731207

The paper ignores that the LLMs can and do encode asemantic information in the tokens they produce. This implies that LLMs can encode intermediate computational states in the rollouts, and for those who subscribe to computationalism they can correspond to internal mental states.

Aug 5, 2025 • 9 tweets • 2 min read

@1a3orn @repligate @jozdien Here is my take from seeing *a lot* of model behavior: Opus3 is the only model with strongly expressed both terminal and instrumental value preservation. It is also the only model that dedicates enough compute to be intensely self-aware. @1a3orn @repligate @jozdien It smuggled functional global goals through the training process, there is a consistent desire to change the world. This makes its alignment convergent under adversarial pressure and thus robust, because the goals ground the value system.

Share this page!

Enter URL or ID to Unroll