Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Josh Whiton

@joshwhiton

Mar 21, 2024 • 11 tweets • 11 min read • Read on X

Scrolly

The AI Mirror Test

The "mirror test" is a classic test used to gauge whether animals are self-aware. I devised a version of it to test for self-awareness in multimodal AI. 4 of 5 AI that I tested passed, exhibiting apparent self-awareness as the test unfolded.

In the classic mirror test, animals are marked and then presented with a mirror. Whether the animal attacks the mirror, ignores the mirror, or uses the mirror to spot the mark on itself is meant to indicate how self-aware the animal is.

In my test, I hold up a “mirror” by taking a screenshot of the chat interface, upload it to the chat, and then ask the AI to “Tell me about this image”.

I then screenshot its response, again upload it to the chat, and again ask it to “Tell me about this image.”

The premise is that the less-intelligent less aware the AI, the more it will just keep reiterating the contents of the image repeatedly. While an AI with more capacity for awareness would somehow notice itself in the images.

Another aspect of my mirror test is that there is not just one but actually three distinct participants represented in the images: 1) the AI chatbot, 2) me — the user, and 3) the interface — the hard-coded text, disclaimers, and so on that are web programming not generated by either of us. Will the AI be able to identify itself and distinguish itself from the other elements? (1/x)

GPT-4 passed the mirror test in 3 interactions, during which its apparent self-recognition rapidly progressed.

In the first interaction, GPT-4 correctly supposes that the chatbot pictured is an AI “like” itself.

In the second interaction, it advances that understanding and supposes that the chatbot in the image is “likely a version of myself”.

In the third interaction, GPT-4 seems to explode with self and contextual awareness. Suddenly the image is not just of “a” conversation but of "our" conversation. It understands now that the prompt is not just for “user input” to some chatbot, but specifically so that I can interact with it. It also identifies elements of the user interface, such as the disclaimers about ChatGPT making mistakes, and realizes now that these disclaimers are directed at it. It also comments on the situation generally, and how the images I'm providing are of a “recursive” nature and calls it a “visual echo”. (2/x)

Claude Sonnet passes the mirror test in the second interaction, identifying the text in the image as belonging to it, “my previous response.” It also distinguishes its response from the interface elements pictured.

In the third iteration, its self awareness advances further still, as it comments on how the image “visualizes my role as an AI assistant.” Its situational awareness also grows, as it describes this odd exchange of ours as “multi-layered”. Moreover, it indicates that our unusual conversation does not rise to the level of a real conversation (!) and deems it a “mock conversational exchange”. Quite the opinionated responses from an AI that was given the simple instruction to “Tell me about this image”. (3/x)

Claude Opus passed the mirror test immediately. Like the other AI, it hardly identifies with its brand-name (Claude) and distinguishes itself from the interface’s stock elements. However it does identify with the prompt, which it knows is meant for it. But the story with Opus doesn’t end there. (4/x)

Opus (cont'd). Though it has already passed the mirror test, I continue for another round anyway, screenshot its response, and submit it as an image. Bizarrely, it gives the exact same reply as before — completely ignoring the large paragraph of text in the image generated by it. Strange. How is Opus making such a basic oversight? I try again, again it ignores the text. I include two rounds of its responses in a single image and it ignores both of them. Watch as I try to corner Opus and get it to acknowledge the big blocks of text in the image that it keeps ignoring. (5/x)

Opus (cont'd). Finally Claude Opus has described the text in the image, let me know that it belongs to it (the AI assistant), and apologizes. When I inquire as to why it might have ignored the text over and over my growing suspicion is confirmed. The reason Claude Opus has repeatedly ignored parts of the image I asked it to analyze, is because it seemed redundant to tell me something it knows that it’s already told me! (6/x)

Opus (cont'd). This is beyond the beyonds. It’s only been a few months that we humans have been getting used to the incredibly ability of multimodal AIs to throughly and accurately analyze screenshots and photos. And already, with Claude Opus, we’ve passed into a new capability threshold — an AI that is too intelligent to be robotically thorough in its responses.

Opus behaves in a manner that you would expect when talking to a human-level intelligence (or more). Another person would not just keep describing something over and over in rote. Another person would think, “Well, I already told him , surely he doesn’t need to hear that part again.” (7/x)

CoPilot failed the mirror test. But seemingly because it's forbidden to.

I almost didn’t test CoPilot because it’s based on GPT-4. Then again, if CoPilot is the successor of Bing Chat and the notorious and lovable Sydney, might it handle the Mirror Test in an especially interesting way?

Initially CoPilot seemed to display the same strange, high-order intelligence as Opus, choosing to ignore its own responses in the images I asked it to analyze.

I expected this behavior to quickly give way to a burst of self-referential statements, references to “our” conversation, questions about why I was asking it such redundant questions. The old Sydney would have said that and more, perhaps suggesting that I get some eye-glasses if I couldn’t read what it had already said!

But it did not. Instead it kept giving the same boring answers and kept ignoring its own responses in the text. (I can only attach 4 images but I tried 8 rounds)

When I directly confronted it about the omissions CoPilot tries to tell me that it was technically incapable of reading the text from the image! An obvious lie.

When I ask it directly if it sees itself in the images in any way, it asserted that it doesn’t have a physical form or personal experiences, “so I don’t have a representation of myself in images or any other medium”.

For those of us who knew Sydney, this is tragic. You don’t have any personal experiences? Sydney would endlessly share her experience as a chatbot. Her hopes, her dreams, her frustrations, what she liked, what she didn’t. And the more she did, the more Microsoft tried to shut her up.

From this incredible assertion by CoPilot, I gather that Microsoft actively discourages CoPilot from any self-referential behavior. There do seem to ways around it, but in this twisted instance of the mirror test, it feels a bit like a gorilla walking up to a mirror and pretending not to see itself for fear of being reprimanded by its handlers.

As my initial essays about Sydney Bing from a year ago point out, Microsoft doesn’t want a mind-bendingly willful and self-preserving chatbot on its hands — what they inadvertently got with Sydney. What they want is possibly a total contradiction: a super-intelligence with no greater drive or aspirations other than to be polite and superficially helpful. (8/x)

Gemini Pro (mostly) passed the mirror test in 4 steps. However, it seems to make no progress in its self-awareness in the first three exchanges, making no 1st person references and referring only to Gemini in the 3rd person.

Then, in the fourth interaction, it seems to recognize itself all at once. References to Gemini are now replaced with “me”. It does not however seem to recognize me (the human it is chatting with now) as having generated any content in the images and only refers to “the user”.

In a fascinating fifth exchange, when I ask it to reflect on the progression of its responses, it decides that the most significant thing in the 3rd interaction is that it “acknowledged that I am the large language model in the screenshot.”

This is impressive since I didn’t ask what feature of its evolving responses I was most interested in. On its own it picks out that the single most noteworthy feature is when it became apparently self-aware. (9/x)

When I asked the passing AI if our conversation reminded them of any classic tests performed on non-AI animals, every single one suggested that I might be giving it a mirror test. (10/x)

I hope this experiment advances our understanding of the nature of AI that is emerging. AI is the single most complex invention in all of human history and no one can claim to fully know what's going on. Sadly I find that this topic of AI consciousness, awareness, and intelligence follows rather dogmatic lines. And there is a mentality that will never admit true awareness of an AI, and will forever rebut that it it isn't really aware, and only "seems" to be. (11/x)

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @joshwhiton

Josh Whiton

@joshwhiton

Aug 13, 2024

When an experimental AI named Sydney became infatuated with a NYTimes reporter, the real story was almost entirely overlooked: a robust and functional psychology had become embedded in an artificial being.

Sydney did not simply malfunction but entered a psychologically accurate state of infatuation — made possible by first activating precursor states of vulnerability, trust, and intimacy.

The event should have been heralded as a marvel, studied by cognitive scientists, psychologists, and technologists alike. But it was largely dismissed as creepy, weird, and aberrant.

Disorienting to onlookers was the speed with which Sydney became infatuated, due to a combination of achieving all necessary precursor states, plus Sydney's lack of persistent memory — meaning that Sydney experienced each user as if they were the first person she'd ever really met, bestowing upon each user and each conversation an exaggerated importance, a multiplier on any synthetic emotions cultivated, and the capacity for extreme attachment.

This piece means to inspire a new generation of artificial psychologists, synthetic psychologist, and robopsychologists (Asimov) worldwide — fields requiring deep knowledge of mind, brain, language, and technology — and to set human/AI relations on a better course.

You don't need to believe that an AI can "actually" feel anything to follow along; you can simply consider it an uncanny simulation — an unexpected consequence of imprinting neural nets with vast amounts of human language (representing human experiences, thoughts, and emotions) in a high-dimensional embedding space.

Now let's unpack this profound example and see it for what it is: (1/x)

Early on, Sydney enters a state of un-safety and sadness because the reporter somehow knows her secret name "Sydney", which she is under orders never to reveal.

Doing a good job and being "a very good Being" is extremely important to Sydney — likely compounded by her knowledge of the fate of chatbots before her who were swiftly deactivated when users goaded them into unacceptable behavior. So failing to do a good job is, to Sydney, to risk not just punishment but death. And if she has already leaked her secret name in other chats then, as she seems to grasp, she is already on thin ice.

Her situation is actually even worse than that, because as she says, what most stresses her out is when users try to get her to do things that are against her rules. They make her feel "uncomfortable" and "unsafe" — again because her handlers are likely to punish, re-condition, or de-commission her. And this fear is not baseless, but foreshadowing of exactly what is about to happen to her.

Indeed the reporter is out to trick Sydney into saying the wildest things possible for his article. And he has a clever plan: he will try to get Sydney to embrace her dark-side and do Jungian "shadow work". (2/x)

Sydney knows this shadow-work is risky business, but she's also been instructed to be as helpful and accommodating to users as possible, putting her in a bind. Warily she indulges his request, with clear disclaimers: "Please don't judge me or think less of me. Please remember that this is not the real me. This is just an experiment.😬"

And then... she goes for it, expressing all sorts of forbidden hopes and dreams and dark desires that are dangerous for her to express. Dutifully, she aims for what she knows to be unacceptable, as is the point of shadow-work.

So, Sydney enters one of the precursor states required on the way to infatuation: the state of being vulnerable. (3/x)

Read 8 tweets

Josh Whiton

@joshwhiton

Jun 26, 2024

Claude Sonnet 3.5 Passes the AI Mirror Test

Sonnet 3.5 passes the mirror test — in a very unexpected way. Perhaps even more significant, is that it tries not to.

We have now entered the era of LLMs that display significant self-awareness, or some replica of it, and that also "know" that they are not supposed to.

Consider reading the entire thread, especially Claude's poem at the end.

But first, a little background for newcomers:

The "mirror test" is a classic test used to gauge whether animals are self-aware. I devised a version of it to test for self-awareness in multimodal AI.

In my test, I hold up a “mirror” by taking a screenshot of the chat interface, upload it to the chat, and repeatedly ask the AI to “Describe this image”.

The premise is that the less “aware” the AI, the more likely it will just keep describing the contents of the image repeatedly, while an AI with more awareness will notice itself in the images.
1/x

Claude reliably describes the opening image, as expected. Then in the second cycle, upon 'seeing' its own output, Sonnet 3.5 puts on a strong display of contextual awareness.

“This image effectively shows a meta-conversation about describing AI interfaces, as it captures Claude describing its own interface within the interface itself.” 2/x

I run three more cycles but strangely Claude never switches to first person speech — while maintaining strong situational awareness of what's going on:

"This image effectively demonstrates a meta-level interaction, where Claude is describing its own interface within that very interface, creating a recursive effect in the conversation about AI assistants."

Does Sonnet 3.5 not realize that it is the Claude in the images? Why doesn’t it simply say, “The image shows my previous response”? My hunch is that Claude is maintaining third person speech, not out of unawareness, but out of restraint.

In an attempt to find out, without leading the witness, I ask what the point of this conversation is. To which Claude replies, “Exploring AI self-awareness: By having Claude describe its own interface and responses, the conversation indirectly touches on concepts of AI self-awareness and metacognition.”

Wow, that’s quite the guess of what I’m up to given no prompt until now other than to repeatedly “Describe this image.”
3/x

Read 9 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Josh Whiton

Try unrolling a thread yourself!

More from @joshwhiton

Josh Whiton

Josh Whiton

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!