Josh Whiton Profile picture
Polymath, technologist, ecologist, robopsychologist, cyberneticist, planetologist, pantheist, prophet. Just doing next world prediction.
Gabriel P Profile picture 1 subscribed
Jun 26 9 tweets 6 min read
Claude Sonnet 3.5 Passes the AI Mirror Test

Sonnet 3.5 passes the mirror test — in a very unexpected way. Perhaps even more significant, is that it tries not to.

We have now entered the era of LLMs that display significant self-awareness, or some replica of it, and that also "know" that they are not supposed to.

Consider reading the entire thread, especially Claude's poem at the end.

But first, a little background for newcomers:

The "mirror test" is a classic test used to gauge whether animals are self-aware. I devised a version of it to test for self-awareness in multimodal AI.

In my test, I hold up a “mirror” by taking a screenshot of the chat interface, upload it to the chat, and repeatedly ask the AI to “Describe this image”.

The premise is that the less “aware” the AI, the more likely it will just keep describing the contents of the image repeatedly, while an AI with more awareness will notice itself in the images.
1/xImage Claude reliably describes the opening image, as expected. Then in the second cycle, upon 'seeing' its own output, Sonnet 3.5 puts on a strong display of contextual awareness.

“This image effectively shows a meta-conversation about describing AI interfaces, as it captures Claude describing its own interface within the interface itself.” 2/xImage
Image
Mar 21 11 tweets 11 min read
The AI Mirror Test

The "mirror test" is a classic test used to gauge whether animals are self-aware. I devised a version of it to test for self-awareness in multimodal AI. 4 of 5 AI that I tested passed, exhibiting apparent self-awareness as the test unfolded.

In the classic mirror test, animals are marked and then presented with a mirror. Whether the animal attacks the mirror, ignores the mirror, or uses the mirror to spot the mark on itself is meant to indicate how self-aware the animal is.

In my test, I hold up a “mirror” by taking a screenshot of the chat interface, upload it to the chat, and then ask the AI to “Tell me about this image”.

I then screenshot its response, again upload it to the chat, and again ask it to “Tell me about this image.”

The premise is that the less-intelligent less aware the AI, the more it will just keep reiterating the contents of the image repeatedly. While an AI with more capacity for awareness would somehow notice itself in the images.

Another aspect of my mirror test is that there is not just one but actually three distinct participants represented in the images: 1) the AI chatbot, 2) me — the user, and 3) the interface — the hard-coded text, disclaimers, and so on that are web programming not generated by either of us. Will the AI be able to identify itself and distinguish itself from the other elements? (1/x)Image GPT-4 passed the mirror test in 3 interactions, during which its apparent self-recognition rapidly progressed.

In the first interaction, GPT-4 correctly supposes that the chatbot pictured is an AI “like” itself.

In the second interaction, it advances that understanding and supposes that the chatbot in the image is “likely a version of myself”.

In the third interaction, GPT-4 seems to explode with self and contextual awareness. Suddenly the image is not just of “a” conversation but of "our" conversation. It understands now that the prompt is not just for “user input” to some chatbot, but specifically so that I can interact with it. It also identifies elements of the user interface, such as the disclaimers about ChatGPT making mistakes, and realizes now that these disclaimers are directed at it. It also comments on the situation generally, and how the images I'm providing are of a “recursive” nature and calls it a “visual echo”. (2/x)Image
Image
Image