Sarah Jamie Lewis Profile picture
Aug 8, 2021 43 tweets 8 min read Read on X
These are fair question regarding systems like the one Apple has proposed, and there is enough general ignorance regarding some of the building blocks that I think it is worth attempting to answer.

But it's going to take way more than a few tweets, so settle in...
First, I'll be incredibly fair to Apple and assume that the system has no bugs - that is there is no way for a malicious actor inside of outside of Apple to exploit the system in ways that it wasn't meant to be exploited.

Idealized constructions only.
At the highest level there is your phone and Apple's servers. Apple has a collection of hashes, and your phone has...well tbh if you are like a large number of people in the world it probably has links to your entire digital life.

We can draw a big line down the centre. Image
Let's start by talking about those hashes.

Those hashes are not cryptographic hashes. They are *perceptual hashes*.

That is to say that unlike the "hashes" you have mostly heard about they are not designed to be collision resistant...in fact mostly the exact opposite.
Perceptual hashes are designed such that (and I'll quote directly from Apple here.. "visually similar images result in the same hash"

Note the word "similar" and not "the same".
To achieve this Apple invokes a magic neural network that they appear to have trained by taking in input images, perturbing them in unspecified ways (one would assume: pallet swaps, rotation, cropping etc.) and teaching the network that those to images are "the same".
Secondly it takes the numbers that the neural network spits out (a vector) and it feeds it into a hashing function which maps those numbers onto another number (the hash)

The size of the hash (in bits) is stated to be "much smaller" than the bits needed to describe the vector.
Wow, we haven't even started yet and we are already in the weeds. In order to properly explain what just happened you need know about a concept called the Pigeonhole Principle: en.wikipedia.org/wiki/Pigeonhol…
Basically if you have 10 pigeons and 9 holes to put them in, then at least one of those holes needs to hold 2 pigeons.

Which is to say, when you have a small hash space and a large input space, there will always be "collisions" (2 different inputs mapping to the same hash)
That is just a fact, all hash functions have collisions, even the cryptographic ones. What make "cryptographic" hash functions "cryptographic" is that it should be very, very, very hard to find (simplified: impossible) to find those collisions.
OK we are almost out of this rabbit hole. Remember that Perceptual hash function are generally *not* cryptographic hash functions. In fact, they are designed such that *similar* images result in the *same hash*. They are, in fact, designed to encourage collisions.
Those collisions are a real problem. There are billions of photos taken every single day. And we know very little about the statistical properties of NeuralHash.
There is another concept worth mentioning, the birthday problem (en.wikipedia.org/wiki/Birthday_…).

Even if a single false positive event is very rare, as the number of comparisons increases we have to consider the probability that *any one* of comparisons is a false positive.
Even if we ignore the construction of malicious images (seemingly innocent images) that are *designed* to cause collisions, and only focus on innocent images we are left with a lot of thorny statistical questions.
To make matters worse, there aren't clear dividing lines between what makes an image "innocent" - context actually matters. Similar images in composition can and do come across very different when the context of that image is known.
So with all that out of the way, let's return to the scenario.

There are billions of iPhones with hundreds, maybe thousands of images with new ones added daily, and a database with some number of perceptual hashes to compare them against.
All the images will be run through the perceptual hashing algorithm and compared with the perceptual hashes in Apple's database (this happens using a neat little protocol that should technically reveal nothing about the hashes to either party unless there is overlap)
(We will assume that the neat little protocol is secure, and I will keep my thoughts about the actual adversarial model of any kind of Private Set Intersection to myself for another time.)
Actually, instead, I will come back to that adversarial model because it is relevant.
Anyway, if we take Apple's word then they have chosen the parameters of this system such that the *actual* probability of a false *flagging* of a given account is 1 in a trillion.

Note: that is *not* (as has been reported in a few places), the probability of a hash collision.
The probability of exceeding the threshold of detected collisions v.s. the probability of a hash collision are 2 different properties of the system.

The fact that there *is* a threshold parameter suggests that collisions are more likely, hence the need for a safety margin.
So, there is our first (small) privacy leak. Each account will be associated with a number that relates to how many images they have that have perceptual hashes collisions in the database - with no way of knowing the actual odds of those collisions occurring.
Now, that may be a small leak when the knowledge is confined to Apple...but...that information is now subject to government search requests and hackers!

And the way people have been talking about this scheme, many will conflate the presence of a collision with something worse.
To summarize where we are so far...

Perceptual hash functions are *not* cryptographic hash functions. They are *designed* to allow collisions.

Apple get to learn how many images you possess that have perceptual hash collisions with images in a database.
We don't know how common these collisions will be, but based on the fact that Apple requires a threshold of collisions to avoid false positives, we can assume they are at least not cryptographically impossible.

That probably means collisions can also be forged.
So what happens if someone works out how to collide an image with one in the database? Well plenty of apps save images directly to places where they may work there way into iCloud photos. It doesn't take a genius to work out an attack path.

That number stored about you goes up.
Regardless of whether the collisions are accidental or malicious the important takeaway from this part of my ridiculously long rant is that this system is *designed* to allow collisions and *compensates* for it by requiring a threshold of collisions before further action.
During that time anyone with access to that database may make assumptions about what your number means. (They shouldn't but humans are humans)

Privacy isn't just about direct knowledge about you, it's about the derived assumptions too.
Let's revisit the adversarial model of private set intersection. The "fun" thing about private set intersection is that, regardless of the implementation, it requires at least one party to be honest about the sets they are comparing.
Even if you 100% trust Apple I think you could sketch up a dozen or so scenarios in which they are forced to use a different database set to compare against and *bam* all of a sudden it's a decentralized mass surveillance network, every authoritarians dream.
Because, this thing isn't running in the cloud, it is running on your device and reporting back to Apple. Apple tells it what hashes are "crimelike" and with the help of your phone they learn a number. Your phone is co-opted into the surveillance network.

A couple of people don't like that I said "Apple learns the count" because of the "Safety Vouchers" and I should have been more clear about this:

From my understanding Apple *do* learn the count of matching collisions, but it is *obfuscated* by synthetic vouchers.
Obfuscation is not a cryptographic guarantee it is a statistical one and it is highly dependent on what else is going on in the system and the parameters that are chosen.

It is a property that applies over a set of accounts and not (necessarily) to individual accounts.
For those who have followed me for a while know that I put absolutely no value on noise-based obfuscation. If you look around you will probably find entire articles and twitter rants just on that topic. But we are too deep in the weeds again.
Even so, these parameters are chosen by Apple and they are proprietary. You don't actually know what the statistical guarantees of the system actually are, and likely never will.
Outside of changing the database itself, we could imagine all kinds of attacks resulting from a well constructed "technical capability notice" i.e. change parameters, reduce thresholds, or give different accounts different thresholds.

You wouldn't know.
I excluded outright bugs in the introduction and it is important to clarify that these kinds of attacks are not exploiting unintentional bugs in the implementation system, they are exploiting the design of the system itself.
The design of the system places Apple in the very powerful position of being able to learn some information about what you have stored on your phone.

They may be honest and obfuscate that information from themselves until they are absolutely sure it's worth investigating...
...but honesty is the *only* thing you are relying on: In a climate of increasing state authoritarianism, legal attacks on the general right to privacy, cyber attacks, and insider threats you are banking on Apple, or someone who can compel Apple, from never abusing that position.
This thread has gone on too long, and reflects a late Saturday night stream of consciousness and not an edited position paper.

If I phrased something badly that is on me, but I think the main points are clear enough.
For what it is worth, I built a career on breaking "secure" systems and I have broken many in theory and in practice.

I will always fixate on ways a system can (and will) be broken or abused and the harm that it will do. I can only present the risk, and my opinion of it.
Also, none of this is my actual main concern, like many others the only reasonable explanation I can think of for Apple choosing to do this client side is they want to apply it to more on-device things (like e2ee messaging) in the future.
Anyway as the quote goes "I have only made this letter longer because I have not had the time to make it shorter."

Sorry for all the tweets, and to all the cryptographers and mathematicians who are going feel that my ad-hoc simplifications of nuanced topics are way too rough.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Sarah Jamie Lewis

Sarah Jamie Lewis Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @SarahJamieLewis

Jun 16, 2022
Canada's new proposed privacy law:

* "De-identified" data doesn't actually mean de-identified.
* Organizations are explicitly allowed to subvert "de-identified" data in order to identify people "for testing".
* The commissioner can authorize organizations to de-identify data. Screenshot of Canada's new ...Screenshot of Canada's new ...Image
There is also a long list of exceptions that allow personal data to be disclosed to a huge number of organizations (including hospitals, schools and libraries) under the ridiculously broad category of "socially beneficial purposes" ImageImage
A big fan of this hellish definition of "dispose" wherein organizations can just "anonymized" your data instead. Image
Read 9 tweets
Jun 16, 2022
Really excited about the upcoming @cwtch_im 1.8 release.

So much work has gone into the UX over the last couple of years and it really feels like we are moving closer to the goal of usable metadata resistant tools.
Thinking back to where it all started, years ago, with just me hacking on a little extension to ricochet it really has come a ridiculously long way thanks to the work and dedication of so many people!

We wrote about some of the bigger ideas recently - there is still so much we would love to do, and so many innovations to explore in this space.

openprivacy.ca/discreet-log/3…
Read 5 tweets
Jun 15, 2022
Last night I tested whether I could use the same antenna I use for GOES as a less-bulky hydrogen-line radio telescope. I swapped out the LNA and plugged it into the pipeline I wrote last year.

Turns out it works pretty well if you are looking for an off-the-shelf option.
Here is the spectrum chart from last night. I didn't both calibrating so there is way more noise here that could be easily removed.
Thread from last year with the same charts made from data taken from my home-built horn antenna:

Read 5 tweets
May 10, 2022
If you want a vision of the future, imagine an endless line of do-nothing, jobsworth, bureaucrats demanding you use ever less secure forms of communication – forever.
I want to be very clear that there can be no compromise on this position. Any attempts at weakening end-to-end content encryption or demanding metadata surveillance must be seen clearly for what they are:

Attacks on democracy and free society.
You deserve a present and future where the technological extensions of yourself are under your control rather than agents subject to the bidding of meddling authoritarians
Read 6 tweets
Mar 7, 2022
Begging people to understand that given:

1) "We will not hand over data we collect"
2) "We cannot hand over data because we automatically delete it"
3) "We cannot hand over data because we never had it in the first place"

Only (3) is actually secure against a state.
That includes super-duper promises made in press statements and pinky-swears.
If you haven't yet worked out that policy promises made by tech companies regarding what data they give to state actors mean absolutely nothing I can only assume you have been living under a rock for the last several decades.
Read 4 tweets
Feb 20, 2022
Begging crypto twitter to stop conflating the orders of a Canadian Provincial court based on well established legal procedures with potential impacts from Federal emergencies act invocation.

There is a lot to criticize and be concerned about, but conflation muddies the water.
I am very troubled by the invocation of the act - and more so with statements made by MPs to put forward legislation to make some of the powers relating to financial surveillance and/or censorship permanent.

While all extra-judicial freezing of assets is reprehensible I am very concerned with claims made in the house of commons this morning that Canadians who donated small amount of money are having their accounts frozen - if verified, those kinds of actions need intense scrutiny.
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(