Just to keep our tweets organized, this will be the thread topper for our live-tweet of session 2 of #pepr20, when the break is over.
Okay! We're back from break. The talk title went by very quickly, ... now there's a pause, hopefully the speaker will introduce themselves again. #pepr20
According to the schedule, this one should be "Building and Deploying a Privacy Preserving Data Analysis Platform", by Frederick Jansen. #pepr20
Ah - this segment was pre-recorded. Impressive blending of live and pre-recorded segments. #pepr20
What is MPC? It's a cryptographic technique to compute a function on private inputs, producing a public output. #pepr20
Multi-party computation is how that acronym expands. Not the same thing as homomorphic encryption, complementary to it. #pepr20
The speaker is from Boston University. They've worked with various orgs (they all look like non-profits from the list) on MPC for social inclusion. #pepr20
MPC was invented in 2008 with work on a "sugar beet auction". #pepr20
MPC was used for tax fraud detection in 2014, salary equity from 2015-2017, corporate spending on minority-owned businesses presently. #pepr20
The speaker is highlighting how it's saddening that it took so long for this theory to get used in practice. #pepr20
The next slide discusses some work by the Boston Women's Workforce Council, detailed in its 2017 report. #pepr20
There was concern about tying people's personally identifiable information to their income. So the suggestion came up to use MPC. #pepr20
Prospective participants need to be able to make an informed decision about whether to be part of it. Since MPC is a theory-intensive, math-intensive technique, people don't have an intuition for whether it means they're safe... #pepr20
... People want to know, is this just access control, relying on you to keep your promise? Is this just encryption at rest, with all of its dangers? #pepr20
In addition to whether people *want* to participate, people are evaluating whether they *can*. Data might be covered by HIPAA or other laws. #pepr20
So the goal was in part to give people an intuition for how MPC works, so they can make those decisions. #pepr20
Example: Additive secret sharing. You pass data around between multiple parties and add more data from each. Somebody asked: Are you leaking the lower bound? #pepr20
So this was a weakness in the explanation because the example suggested weaker privacy properties than the algorithm actually gives. #pepr20
Next example: Clocks for finite field arithmetic. The math made sense to people, but people had trouble connecting clocks to their data. #pepr20
Third example, the one they still use: Additive secret sharing, but showing an additional piece of randomly generated data called the mask. #pepr20
As part of the example, they show how the masks cancel out. #pepr20
There's a slide with a protocol diagram that they show to people. #pepr20
Other considerations: At some point they have to say, extrapolate from here... the example was for addition but this works for other functions too. #pepr20
Another way they build trust is saying, look at who our partners are, these orgs trust us. #pepr20
They do have other examples they present in special situations. Shamir's secret sharing. Computing the average birthday of a group while lying about your age. #pepr20
Sounds like the birthday one is interactive, with audience participation. That's pretty cool. #pepr20
Yes this is a talk *about* other presentations the speaker gives. How meta. #pepr20
Wording matters: "Upload", "submit", "participate" all give users different expectations about where their data is going. #pepr20
Why build this as a custom solution? They didn't really have a choice, they tried to buy a vendor solution but the vendor tried to sell them a VM for participants to run... this wasn't suitable for inexperienced users. #pepr20
Also the VM approach didn't work well with enterprise security policies such as the ones that participants at hospitals had. #pepr20
The vendor was skeptical of a web-based approach. The researchers understand the limits but decided to build it anyway. It's now at version 3. #pepr20
The system has extensive error-checking, because if users make data entry errors or misunderstand the algorithm, MPC will make it hard to correct later. #pepr20
Quantifying privacy and leakage... Is it enough? This is a philosophical question more than a technical one. #pepr20
Sometimes these decisions are easy. Don't allow queries that reveal a single row. Don't allow repeated queries that have only minor modifications. #pepr20
Sometimes it's harder. How many participants do you need to guarantee privacy? What if participants are anonymous? Which algorithms do you use? #pepr20
The analysts want to know who participated, because it helps them encourage more people to participate... #pepr20
The rule about no single-row results might be too strict if participants are anonymous. But this is salary data, which might be enough to identify people even if it's de-identified. #pepr20
You could use differential privacy here, and they do want to implement that, but it doesn't look likely to work with this particular data, because accuracy is important. #pepr20
With deployment 1, they had a lot of invalid data. With deployment 2 they had semantic errors. They've tried hard to support people's browsers, but as a fallback they also allow people to fax stuff. #pepr20
One participant assumed a wrong upper bound, leaking the lower bound. #pepr20
They had an issue where they needed to recover from a data entry error, which worked (yay!) but leaked the order of magnitude (oops). #pepr20
The speaker acknowledges the efforts of their team, which had many people working on it for years. #pepr20
Question: Why no differential privacy?
Answer: First, in our case it didn't solve the problem of people not wanting to submit their data. Somebody still has to collect it. #pepr20
Second, the makeup of the data had too small a sample size for differential privacy to work with a reasonable accuracy trade-off. #pepr20
(We're impressed by how clean everyone's homes are. Weird doing this by Zoom.) #pepr20
They're testing the limitations of how MPC interacts with HIPAA and GDPR. If you don't really have the row-level data, does GDPR apply? Hasn't been tested in court yet. #pepr20
MPC allows you to avoid owning or controlling any of the "toxic data". All you care about from the computation is a yes/no answer, even though PII goes into it. #pepr20
Next talk! "Audience Engagement API: A Privacy Preserving Data Analytics System at Scale", from LinkedIn. #pepr20
The talk will start with a brief overview of differential privacy, followed by a use-case that LinkedIn has. #pepr20
The team's mission is to maintain the usefulness of data while still protecting users. There is an apparent tension here. #pepr20
There are many reasons to do privacy. Regulations such as GDPR and CCPA. "Members first" (LinkedIn's users are "members", apparently.) Anonymization doesn't work well enough on its own. "Anonymized data isn't" - Cynthia Dwork. #pepr20
87% of the US is uniquely identified by a (date of birth, gender, zip code) tuple. #pepr20
There are lots of potential attacks... you could mitigate each known attack. Or you could take a principled approach, which is what differential privacy is good for. #pepr20
Differential privacy transforms your data before you run your computation on it. The algorithm you run might be a lot of different things, you want to add noise to make it safe no matter what the computation is. #pepr20
Thought experiment: What happens if we eliminate one person's data and re-run the same algorithm? The result will be a shifted distribution... #pepr20
We quantify how much privacy the algorithm preserves based on how much these distributions differ. #pepr20
(As a plural system, we love getting to use "we" in contexts like this lol. Ambiguity is fun!)
Differential privacy also tries to make sure that the input distribution and output distribution are close to one another. You quantify this with the privacy loss parameter. #pepr20
Two distinct models of differential privacy, the talk will discuss both. You can have locally generated data that then gets sent to a central data center... #pepr20
... depending on where you are, you might introduce privacy in different places. The local model applies differential privacy before you send the data to the data center. Microsoft, Google, Apple all do this in some places. #pepr20
The other model is the global model. In this scenario, users generate data that is already in a central data center, then they access it. We want to ensure that the data itself is private, since it might be shown outside the company. #pepr20
This was done in the 2020 census; Microsoft and Google have both released tools for it. #pepr20
Now, on to LinkedIn's Audience Engagement API. This is built on top of Pinot for realtime analytics. #pepr20
At a high level, advertisers interact with this API, and they can study the data from their query and then make a new query. This means that differencing attacks are a concern. #pepr20
Since these advertising events are things like CEOs viewing particular news articles, privacy is important... #pepr20
In general their API allows top-k queries. What are the top articles in a given region? Who are the top 10 data scientists? #pepr20
Differential privacy deals with the concept of "sensitivity" - how much the result varies based on one user's data. #pepr20
You also want to limit overall API usage so people can't just reconstruct the whole dataset. #pepr20
Existing systems that LinkedIn has: A top-k-prime solver, which generates a histogram of possible outcomes... #pepr20
They don't want to modify the underlying data, because this would require too much computation per-query, it would be expensive. So they generate these aggregate statistics and apply differential privacy *to the aggregates*. #pepr20
The top-k-prime solver runs inside Pinot, which is an open-source tool that LinkedIn uses heavily. #pepr20
Okay, now how does data flow from Pinot to eventually reach the application and the marketing partner? #pepr20
The differential privacy algorithms apply in between those two steps. #pepr20
Sensitivity, how much one user can modify the results. Let's pick a particular query: What are the top 10 countries for a particular skill set? Ask: What happens when we take one user out. One user can be in at most one country (if you say so...) #pepr20
For a slightly different query, top-10 skills in the Bay Area. Here, one user's data can modify multiple skills. #pepr20
That's because you can add as many skills as you want to your profile. #pepr20
For this, they use the exponential mechanism [MT07]. #pepr20
As long as they only release the elements, not the counts, this protects stuff. #pepr20
Known algorithms: delta-restricted sensitivity, or unrestricted sensitivity. #pepr20
One thing to notice about these algorithms: They require you to know the full data domain. For example, you need to know in advance what all the possible countries are, even for the ones with no users... because you have to be able to evaluate what one new user does. #pepr20
This set might be way too large, or not even known. So the discovery portion of the algorithms needs more work to study the domain. #pepr20
In the unknown setting, the mere presence of an element already tells you that at least one person engaged with an article (for example), which can be problematic. #pepr20
So they wrote a paper that was accepted last year on the unknown setting. #pepr20
So, back to the overall flow diagram. For marketing partners who issue a lot of queries, the system has a budget manager which enforces a privacy budget. #pepr20
They computed privacy loss computation bounds earlier; they also use these bounds as part of the budget manager. #pepr20
Question: Epsilon... How did you choose it?
Answer: It's a tuning knob. It lets us smoothly interpolate between privacy and utility. Then you can measure your utility metric, what makes the product actually usable... #pepr20
Figure out where the diminishing returns are, set your budget before that point. #pepr20
In these unknown domain algorithms, there's also a delta, which loosely means how much of a threshold to have. #pepr20
In the budget management service, they enforce rules about how many parameters to return and how many queries can be run. #pepr20
Each use-case requires making a new decision about epsilon. You can't fully automate privacy, you need a human in the loop. #pepr20
Question: Is there any feedback mechanism about privacy budget short of "you've run out now"?
Answer: They provide an API that exposes that, but the UI is up to the marketing partner to implement. #pepr20
Question: How did you verify your implementation?
Answer: Testing in the presence of randomness is hard, so they measure whether it's the right distribution. Divide-by-zero "keeps me up at night". #pepr20
They have unit tests. Some of them are described in one of the papers they referenced, will post the link in Slack. #pepr20
Lots of conversation starting in the channel #pepr20
Last talk before the break coming up. "We have been promised more epsilon!", Lea says. #pepr20
"Improving usability of differential privacy at scale", by two Googlers. The slide deck uses a Material Design slide template... as a Xoogler this feels comfy to us, lol. #pepr20
A simple data set. Five rows, columns related to customer IDs, movies watched, movie ratings. #pepr20
The example query, in SQL, selects date, rating, and count of movies, grouping by date and rating. This may look private, but how do you make sure? #pepr20
Suppose you transform the query. Not just SELECT but SELECT WITH ANONYMIZATION OPTIONS (epsilon = ..., delta = ...) #pepr20
You can do that, but now you have a usability issue. So they want to quantify privacy and utility. #pepr20
Define privacy and utility metrics, provide infrastructure to compute it at scale, allow users to get data self-serve. #pepr20
It's a page titled "usable differential privacy", with a lot of input parameters. Anonymization parameters. Infrastructure: Choose Flume, SQL, or Custom. (Flume is a Google internal tool, the speaker says.) #pepr20
For this demo they will use SQL. The next section has data utility parameters - well, just sensitivity. #pepr20
The last set of parameters are filters to slice down and only look at the data in a certain range. #pepr20
The page also has metadata about the input dataset (still the movie ratings example). It says how the data is being partitioned, which in this case is by date. It also says how we're aggregating the value - COUNT. #pepr20
Finally, there's a section with anonymization statistics. Threshold (of number of users in each bucket). #pepr20
Another stat it shows is noise standard deviation. Then there's a few about how the partitioning works - how many partitions there are, how many of them are sufficiently anonymous to view in the output. #pepr20
Finally, the stats show the raw sum of all the values, in order to give you an intuition for overall accuracy loss. #pepr20
It then computes the same sum on the anonymized data, and shows you how much loss there is - 70% in this case. #pepr20
Now the output. This histogram shows the distribution of values. Buckets of value ranges on the x axis. The y axis is how many partitions have values within that range. The histogram shows raw data and also anonymized data, as stacked bars. #pepr20
This is a log-scale graph. That's because it's a long-tail distribution and a raw view wouldn't be informative. #pepr20
Now there's a graph called "change distributions". This is a histogram where x is relative change, and y is number partitions experiencing that amount of change. This one looks like a normal distribution (to our eye). #pepr20
Last plot: "omission distributions". Histogram with x is "omission", y is number of partitions. Not clear what x means... #pepr20
Okay, so a user of this tool would use it to study their data and explore what the anonymization parameters do to their utility. It lets you get the statistics in more-or-less real time. #pepr20
Try putting an epsilon of 1, delta .00001. We can see from the anon value sum that this is 70% loss. #pepr20
Try changing a thing (missed what), we can see that the accuracy loss is still bad. Okay, try increasing the sensitivity bound... #pepr20
Now we're at only 30% drop (70% accuracy). However, this increases the noise standard deviation. The distributions look better, we get to keep more anonymized partitions. #pepr20
On the change distribution, it's still a normal distribution but has shifted to the right... which is less change, apparently. #pepr20
One more tweak. Again, missed what parameter was changed and how. Now the change distribution is way up on the right edge of the graph, 10%, 5/%? (This is going too fast to catch all the details. Sorry!) #pepr20
The benefit to "me" as a product team is now there's no need to run the pipeline many times and do the metrics manually. So this tool is for product teams... internal to Google? Or marketing partners? Sounds like the former. #pepr20
(It would make a lot more sense for this to be an internal tool; it's hard to see how these queries could be run by an external user without, themselves, being a privacy risk. So we're pretty sure that's the point of it.) #pepr20
Now there's a system architecture diagram. Source data and a user-created script are fed into the SQL engine and produce a "sketch". The sketch is then fed into the SQL engine again and fed to the UI, where another user interacts with it. #pepr20
Median query latency is *seconds*! (boldface on the slide). End-to-end analysis is *minutes*. *Intuitive*. Yay! Desirable qualities. #pepr20
In the future, they hope to open-source this work to let the community add to it. (Okay, but Google is not good at merging external patches... - ed) #pepr20
They hope to adapt this also to local differential privacy, and to add support for more functions beyond count. #pepr20
If you want to see this as open-source, express enthusiasm for it in Slack! #pepr20
Question: Do you worry about privacy loss from your comparison charts?
Answer: The use-case is for people who already have access to the underlying data, and want to share it with a broader audience. (So, what we guessed above.) #pepr20
Question: The demo showed some specific metrics with the anonymization stats. How does the tool user know what numbers are good, do you provide guidance?
Answer: Yes. Documentation, training, and consulting with experts. #pepr20
Question: When the original data is de-identified, does it make sense to keep around a pseudonymous ID to implement differential privacy, rather than just de-identifying entirely?
Answer: It depends on whether your purpose is internal or external. #pepr20
It's 11:00. Now there will be break-out into Zoom networking sessions. Lea says it's disappointing we can't do this over food like at a physical conference. "We are a large percentage of all the privacy engineers in the world at this conference", so... #pepr20
... go to the "birds of a feather" sessions. We will be going to the "women and enbies" one! Hope to see some of you there. #pepr20
We'll continue over on a separate thread, to keep things organized. #pepr20
We're live-tweeting PEPR20! After the break, this will be the thread head for the fourth block of talks ("session"), which will be the last one for the first day. #pepr20
"Product Privacy Journey: Towards a Product Centric Privacy Engineering Framework", by Igor Trindale Oliveira, is now starting.
Why a product-centric approach? Other possible focuses would be compliance, design, engineering, users... #pepr20
Okay! This will be the thread head for the third session of #pepr20, which will re-convene after the birds-of-a-feather breakout sessions, in about ten minutes.
We have a secret motive for tweeting this, it helps us pay attention. Our brain doesn't cling to things unless we're using *all* of our brain.
Okay, the theme of this next block of talks ("session") is design. So now we're on slack channel 3. #pepr20
Okay! We will be live-tweeting #PEPR20, the USENIX conference on Privacy Engineering Practice and Respect. Feel free to mute that hashtag if you don't want to drown in tweets.
Just so people know, if you're a trans person working any sort of professional job and you're interested in advocating to your company about healthcare, we're happy to chat privately about what to ask for and how.
We were heavily involved in efforts around that during our time at Google, and there's a lot of transferable knowledge that applies to any company.
Belatedly, we realized that because we DO have that highly detailed knowledge on this topic, we should directly talk about Discord's thing.
The thing you have to understand about America is that anyone who grew up there, grew up being fed propaganda that most of us took at face value. That sounds like an extreme position, but it's the literal truth.
The US mythologizes its own impact on the world, focusing only on the positives and glossing over the negatives.
The US mythologizes its own *place* in the world, declaring itself a leader in all sorts of things - public health; infrastructure; democracy - where it is nothing of the sort, and has not been for a long, long time.
Here's a thought we've shared privately, but it's taken time for us to get a formulation of it that doesn't ramble too much.
When people talk about working for change "within the system" vs. "outside the system", what system do they mean?
Answer: It depends! A thread.
People without a science background, or even people with that background who don't also pay attention to the humanities, may not realize that the word "system", in its modern sense, had to be invented. It wasn't a single moment, either, the idea was refined over many years.
Wikipedia has a page that's titled just "System", because it's a more interesting concept than you might realize. en.wikipedia.org/wiki/System#Hi…