Tweet

Irenes (many)

15 Oct, 140 tweets, 40 min read

Just to keep our tweets organized, this will be the thread topper for our live-tweet of session 2 of #pepr20, when the break is over.

Okay! We're back from break. The talk title went by very quickly, ... now there's a pause, hopefully the speaker will introduce themselves again. #pepr20

According to the schedule, this one should be "Building and Deploying a Privacy Preserving Data Analysis Platform", by Frederick Jansen. #pepr20

Okay, talk's starting for real now. #pepr20

It would be a good idea to have an organizer say a few words after each break to wake everyone up and let us know to start listening. :) #pepr20

Okay, now Lea is doing that :) #pepr20

Ah - this segment was pre-recorded. Impressive blending of live and pre-recorded segments. #pepr20

What is MPC? It's a cryptographic technique to compute a function on private inputs, producing a public output. #pepr20

Multi-party computation is how that acronym expands. Not the same thing as homomorphic encryption, complementary to it. #pepr20

The speaker is from Boston University. They've worked with various orgs (they all look like non-profits from the list) on MPC for social inclusion. #pepr20

MPC was invented in 2008 with work on a "sugar beet auction". #pepr20

MPC was used for tax fraud detection in 2014, salary equity from 2015-2017, corporate spending on minority-owned businesses presently. #pepr20

The speaker is highlighting how it's saddening that it took so long for this theory to get used in practice. #pepr20

The next slide discusses some work by the Boston Women's Workforce Council, detailed in its 2017 report. #pepr20

There was concern about tying people's personally identifiable information to their income. So the suggestion came up to use MPC. #pepr20

Prospective participants need to be able to make an informed decision about whether to be part of it. Since MPC is a theory-intensive, math-intensive technique, people don't have an intuition for whether it means they're safe... #pepr20

... People want to know, is this just access control, relying on you to keep your promise? Is this just encryption at rest, with all of its dangers? #pepr20

In addition to whether people *want* to participate, people are evaluating whether they *can*. Data might be covered by HIPAA or other laws. #pepr20

So the goal was in part to give people an intuition for how MPC works, so they can make those decisions. #pepr20

Example: Additive secret sharing. You pass data around between multiple parties and add more data from each. Somebody asked: Are you leaking the lower bound? #pepr20

So this was a weakness in the explanation because the example suggested weaker privacy properties than the algorithm actually gives. #pepr20

Next example: Clocks for finite field arithmetic. The math made sense to people, but people had trouble connecting clocks to their data. #pepr20

Third example, the one they still use: Additive secret sharing, but showing an additional piece of randomly generated data called the mask. #pepr20

As part of the example, they show how the masks cancel out. #pepr20

There's a slide with a protocol diagram that they show to people. #pepr20

Other considerations: At some point they have to say, extrapolate from here... the example was for addition but this works for other functions too. #pepr20

Another way they build trust is saying, look at who our partners are, these orgs trust us. #pepr20

They do have other examples they present in special situations. Shamir's secret sharing. Computing the average birthday of a group while lying about your age. #pepr20

Sounds like the birthday one is interactive, with audience participation. That's pretty cool. #pepr20

Yes this is a talk *about* other presentations the speaker gives. How meta. #pepr20

Wording matters: "Upload", "submit", "participate" all give users different expectations about where their data is going. #pepr20

Why build this as a custom solution? They didn't really have a choice, they tried to buy a vendor solution but the vendor tried to sell them a VM for participants to run... this wasn't suitable for inexperienced users. #pepr20

Also the VM approach didn't work well with enterprise security policies such as the ones that participants at hospitals had. #pepr20

The vendor was skeptical of a web-based approach. The researchers understand the limits but decided to build it anyway. It's now at version 3. #pepr20

The system has extensive error-checking, because if users make data entry errors or misunderstand the algorithm, MPC will make it hard to correct later. #pepr20

Quantifying privacy and leakage... Is it enough? This is a philosophical question more than a technical one. #pepr20

Sometimes these decisions are easy. Don't allow queries that reveal a single row. Don't allow repeated queries that have only minor modifications. #pepr20

Sometimes it's harder. How many participants do you need to guarantee privacy? What if participants are anonymous? Which algorithms do you use? #pepr20

The analysts want to know who participated, because it helps them encourage more people to participate... #pepr20

The rule about no single-row results might be too strict if participants are anonymous. But this is salary data, which might be enough to identify people even if it's de-identified. #pepr20

You could use differential privacy here, and they do want to implement that, but it doesn't look likely to work with this particular data, because accuracy is important. #pepr20

With deployment 1, they had a lot of invalid data. With deployment 2 they had semantic errors. They've tried hard to support people's browsers, but as a fallback they also allow people to fax stuff. #pepr20

(Fax! Wow. That's not anonymous...) #pepr20

One participant assumed a wrong upper bound, leaking the lower bound. #pepr20

They had an issue where they needed to recover from a data entry error, which worked (yay!) but leaked the order of magnitude (oops). #pepr20

The speaker acknowledges the efforts of their team, which had many people working on it for years. #pepr20

Question: Why no differential privacy?

Answer: First, in our case it didn't solve the problem of people not wanting to submit their data. Somebody still has to collect it. #pepr20

Second, the makeup of the data had too small a sample size for differential privacy to work with a reasonable accuracy trade-off. #pepr20

(We're impressed by how clean everyone's homes are. Weird doing this by Zoom.) #pepr20

They're testing the limitations of how MPC interacts with HIPAA and GDPR. If you don't really have the row-level data, does GDPR apply? Hasn't been tested in court yet. #pepr20

MPC allows you to avoid owning or controlling any of the "toxic data". All you care about from the computation is a yes/no answer, even though PII goes into it. #pepr20

Next talk! "Audience Engagement API: A Privacy Preserving Data Analytics System at Scale", from LinkedIn. #pepr20

The talk will start with a brief overview of differential privacy, followed by a use-case that LinkedIn has. #pepr20

The team's mission is to maintain the usefulness of data while still protecting users. There is an apparent tension here. #pepr20

There are many reasons to do privacy. Regulations such as GDPR and CCPA. "Members first" (LinkedIn's users are "members", apparently.) Anonymization doesn't work well enough on its own. "Anonymized data isn't" - Cynthia Dwork. #pepr20

87% of the US is uniquely identified by a (date of birth, gender, zip code) tuple. #pepr20

There are lots of potential attacks... you could mitigate each known attack. Or you could take a principled approach, which is what differential privacy is good for. #pepr20

Differential privacy transforms your data before you run your computation on it. The algorithm you run might be a lot of different things, you want to add noise to make it safe no matter what the computation is. #pepr20

Thought experiment: What happens if we eliminate one person's data and re-run the same algorithm? The result will be a shifted distribution... #pepr20

We quantify how much privacy the algorithm preserves based on how much these distributions differ. #pepr20

(As a plural system, we love getting to use "we" in contexts like this lol. Ambiguity is fun!)

Differential privacy also tries to make sure that the input distribution and output distribution are close to one another. You quantify this with the privacy loss parameter. #pepr20

Two distinct models of differential privacy, the talk will discuss both. You can have locally generated data that then gets sent to a central data center... #pepr20

... depending on where you are, you might introduce privacy in different places. The local model applies differential privacy before you send the data to the data center. Microsoft, Google, Apple all do this in some places. #pepr20

The other model is the global model. In this scenario, users generate data that is already in a central data center, then they access it. We want to ensure that the data itself is private, since it might be shown outside the company. #pepr20

This was done in the 2020 census; Microsoft and Google have both released tools for it. #pepr20

Now, on to LinkedIn's Audience Engagement API. This is built on top of Pinot for realtime analytics. #pepr20

At a high level, advertisers interact with this API, and they can study the data from their query and then make a new query. This means that differencing attacks are a concern. #pepr20

Since these advertising events are things like CEOs viewing particular news articles, privacy is important... #pepr20

In general their API allows top-k queries. What are the top articles in a given region? Who are the top 10 data scientists? #pepr20

Differential privacy deals with the concept of "sensitivity" - how much the result varies based on one user's data. #pepr20

You also want to limit overall API usage so people can't just reconstruct the whole dataset. #pepr20

Existing systems that LinkedIn has: A top-k-prime solver, which generates a histogram of possible outcomes... #pepr20

They don't want to modify the underlying data, because this would require too much computation per-query, it would be expensive. So they generate these aggregate statistics and apply differential privacy *to the aggregates*. #pepr20

The top-k-prime solver runs inside Pinot, which is an open-source tool that LinkedIn uses heavily. #pepr20

Okay, now how does data flow from Pinot to eventually reach the application and the marketing partner? #pepr20

The differential privacy algorithms apply in between those two steps. #pepr20

Sensitivity, how much one user can modify the results. Let's pick a particular query: What are the top 10 countries for a particular skill set? Ask: What happens when we take one user out. One user can be in at most one country (if you say so...) #pepr20

So we can use a Laplace mechanism. #pepr20

For a slightly different query, top-10 skills in the Bay Area. Here, one user's data can modify multiple skills. #pepr20

That's because you can add as many skills as you want to your profile. #pepr20

For this, they use the exponential mechanism [MT07]. #pepr20

As long as they only release the elements, not the counts, this protects stuff. #pepr20

Known algorithms: delta-restricted sensitivity, or unrestricted sensitivity. #pepr20

One thing to notice about these algorithms: They require you to know the full data domain. For example, you need to know in advance what all the possible countries are, even for the ones with no users... because you have to be able to evaluate what one new user does. #pepr20

This set might be way too large, or not even known. So the discovery portion of the algorithms needs more work to study the domain. #pepr20

In the unknown setting, the mere presence of an element already tells you that at least one person engaged with an article (for example), which can be problematic. #pepr20

So they wrote a paper that was accepted last year on the unknown setting. #pepr20

So, back to the overall flow diagram. For marketing partners who issue a lot of queries, the system has a budget manager which enforces a privacy budget. #pepr20

They computed privacy loss computation bounds earlier; they also use these bounds as part of the budget manager. #pepr20

Question: Epsilon... How did you choose it?

Answer: It's a tuning knob. It lets us smoothly interpolate between privacy and utility. Then you can measure your utility metric, what makes the product actually usable... #pepr20

Figure out where the diminishing returns are, set your budget before that point. #pepr20

In these unknown domain algorithms, there's also a delta, which loosely means how much of a threshold to have. #pepr20

In the budget management service, they enforce rules about how many parameters to return and how many queries can be run. #pepr20

Each use-case requires making a new decision about epsilon. You can't fully automate privacy, you need a human in the loop. #pepr20

Question: Is there any feedback mechanism about privacy budget short of "you've run out now"?

Answer: They provide an API that exposes that, but the UI is up to the marketing partner to implement. #pepr20

Question: How did you verify your implementation?

Answer: Testing in the presence of randomness is hard, so they measure whether it's the right distribution. Divide-by-zero "keeps me up at night". #pepr20

They have unit tests. Some of them are described in one of the papers they referenced, will post the link in Slack. #pepr20

Lots of conversation starting in the channel #pepr20

Last talk before the break coming up. "We have been promised more epsilon!", Lea says. #pepr20

"Improving usability of differential privacy at scale", by two Googlers. The slide deck uses a Material Design slide template... as a Xoogler this feels comfy to us, lol. #pepr20

A simple data set. Five rows, columns related to customer IDs, movies watched, movie ratings. #pepr20

The example query, in SQL, selects date, rating, and count of movies, grouping by date and rating. This may look private, but how do you make sure? #pepr20

Suppose you transform the query. Not just SELECT but SELECT WITH ANONYMIZATION OPTIONS (epsilon = ..., delta = ...) #pepr20

You can do that, but now you have a usability issue. So they want to quantify privacy and utility. #pepr20

Define privacy and utility metrics, provide infrastructure to compute it at scale, allow users to get data self-serve. #pepr20

There's going to be a demo! #pepr20

It's a page titled "usable differential privacy", with a lot of input parameters. Anonymization parameters. Infrastructure: Choose Flume, SQL, or Custom. (Flume is a Google internal tool, the speaker says.) #pepr20

For this demo they will use SQL. The next section has data utility parameters - well, just sensitivity. #pepr20

The last set of parameters are filters to slice down and only look at the data in a certain range. #pepr20

The page also has metadata about the input dataset (still the movie ratings example). It says how the data is being partitioned, which in this case is by date. It also says how we're aggregating the value - COUNT. #pepr20

Finally, there's a section with anonymization statistics. Threshold (of number of users in each bucket). #pepr20

Another stat it shows is noise standard deviation. Then there's a few about how the partitioning works - how many partitions there are, how many of them are sufficiently anonymous to view in the output. #pepr20

Finally, the stats show the raw sum of all the values, in order to give you an intuition for overall accuracy loss. #pepr20

It then computes the same sum on the anonymized data, and shows you how much loss there is - 70% in this case. #pepr20

Now the output. This histogram shows the distribution of values. Buckets of value ranges on the x axis. The y axis is how many partitions have values within that range. The histogram shows raw data and also anonymized data, as stacked bars. #pepr20

This is a log-scale graph. That's because it's a long-tail distribution and a raw view wouldn't be informative. #pepr20

Now there's a graph called "change distributions". This is a histogram where x is relative change, and y is number partitions experiencing that amount of change. This one looks like a normal distribution (to our eye). #pepr20

Last plot: "omission distributions". Histogram with x is "omission", y is number of partitions. Not clear what x means... #pepr20

Okay, so a user of this tool would use it to study their data and explore what the anonymization parameters do to their utility. It lets you get the statistics in more-or-less real time. #pepr20

There will now be a worked example. #pepr20

We're still on the movie ratings. #pepr20

Try putting an epsilon of 1, delta .00001. We can see from the anon value sum that this is 70% loss. #pepr20

Try changing a thing (missed what), we can see that the accuracy loss is still bad. Okay, try increasing the sensitivity bound... #pepr20

Now we're at only 30% drop (70% accuracy). However, this increases the noise standard deviation. The distributions look better, we get to keep more anonymized partitions. #pepr20

On the change distribution, it's still a normal distribution but has shifted to the right... which is less change, apparently. #pepr20

One more tweak. Again, missed what parameter was changed and how. Now the change distribution is way up on the right edge of the graph, 10%, 5/%? (This is going too fast to catch all the details. Sorry!) #pepr20

The benefit to "me" as a product team is now there's no need to run the pipeline many times and do the metrics manually. So this tool is for product teams... internal to Google? Or marketing partners? Sounds like the former. #pepr20

(It would make a lot more sense for this to be an internal tool; it's hard to see how these queries could be run by an external user without, themselves, being a privacy risk. So we're pretty sure that's the point of it.) #pepr20

Now there's a system architecture diagram. Source data and a user-created script are fed into the SQL engine and produce a "sketch". The sketch is then fed into the SQL engine again and fed to the UI, where another user interacts with it. #pepr20

Median query latency is *seconds*! (boldface on the slide). End-to-end analysis is *minutes*. *Intuitive*. Yay! Desirable qualities. #pepr20

In the future, they hope to open-source this work to let the community add to it. (Okay, but Google is not good at merging external patches... - ed) #pepr20

They hope to adapt this also to local differential privacy, and to add support for more functions beyond count. #pepr20

If you want to see this as open-source, express enthusiasm for it in Slack! #pepr20

Question: Do you worry about privacy loss from your comparison charts?

Answer: The use-case is for people who already have access to the underlying data, and want to share it with a broader audience. (So, what we guessed above.) #pepr20

Question: The demo showed some specific metrics with the anonymization stats. How does the tool user know what numbers are good, do you provide guidance?

Answer: Yes. Documentation, training, and consulting with experts. #pepr20

Question: When the original data is de-identified, does it make sense to keep around a pseudonymous ID to implement differential privacy, rather than just de-identifying entirely?

Answer: It depends on whether your purpose is internal or external. #pepr20

It's 11:00. Now there will be break-out into Zoom networking sessions. Lea says it's disappointing we can't do this over food like at a physical conference. "We are a large percentage of all the privacy engineers in the world at this conference", so... #pepr20

... go to the "birds of a feather" sessions. We will be going to the "women and enbies" one! Hope to see some of you there. #pepr20

https://twitter.com/ireneista/status/1316808561565986817

We'll continue over on a separate thread, to keep things organized. #pepr20

https://twitter.com/ireneista/status/1316808561565986817

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Irenes (many)

Try unrolling a thread yourself!

More from @ireneista

Irenes (many)

Irenes (many)

Irenes (many)

Irenes (many)

Irenes (many)

Irenes (many)

Did Thread Reader help you today?

Like this author's thread?