Okay! We will be live-tweeting #PEPR20, the USENIX conference on Privacy Engineering Practice and Respect. Feel free to mute that hashtag if you don't want to drown in tweets.
Lea says the conference was gonna be in California back in May, but then 2020 happened so here we all are in October dialing in from home. #pepr20
That list of organizers is longer than last year's, we think. Nice. :) #pepr20
Lorrie Cranor is talking now about how the conference will go. There's a Slack. (We're there!) #pepr20
[CW food] There will be a virtual ice cream social tomorrow. #pepr20
In addition to a long-running "hallway track" channel on Slack, there will be two "birds of a feather" sessions, one at 11am US/Pacific today, one the same time tomorrow. #pepr20
It had already been announced by email that Usenix can't host PEPR (among other conferences) in 2021, due to their own funding issues. PEPR will continue, there will be another venue. Details when they're ready, but not yet. #pepr20
pepr.tech will have the details for PEPR21 when they exist. They could use sponsors and volunteers! #pepr20
There will be a survey after the conference. Fill it out. #pepr20
There's one Slack channel per session, in addition to some general channels. Interesting choice. #pepr20
Amanda Walker will be giving the first session! #pepr20
Amanda Walker: "Beyond Access: Using ABAC frameworks to implement privacy and security frameworks." #pepr20
Amanda is unfortunately not at liberty to give code snippets due to corporate confidentiality, so this will be more of a lessons learned talk. #pepr20
A history of access control. Physical, identity, role, context. #pepr20
UNIX permissions are the classic example of identity-based access control. #pepr20
Role-based access control applies business rules to access. In addition to identity, there's now a concept of role. Group membership is one version of this. #pepr20
Even RBAC can be a problem. Group membership gets used as a proxy for other attributes of the user. For example, badging into an office in the US adds users to a "currently in the US" group. #pepr20
These tricks *work*, but they don't *scale*. Updating a large database takes time. #pepr20
With attribute based access control, we look at other properties of a user beyond just their roles. OASIS XACML (2001) was an early language for describing these policies. Not a mechanism, just a language. #pepr20
NIST put out a standard in 2014. Microsoft has SDDL. There are others. #pepr20
These ABAC systems (we're gonna mistype that as ACAB, we just know it) ... these systems have a few common types of attributes. Subject, action, object, context. #pepr20
The subject is who's requesting access. The action is what they're trying to do. The object is what they're doing it to. The context is stuff like time of day. #pepr20
There have been two different approaches to this. The first is to put attributes on objects. This is fast, it lets rules be evaluated inline. #pepr20
Implementations of this approach will often have the policy written directly in the code. This is an advantage! ... It is also a disadvantage. It makes things hard to refactor. #pepr20
Even if you centralize this in a library, changes to it still require rebuilding and pushing new binaries. #pepr20
Not everything you want to know about data remains static over its lifetime. #pepr20
The other, more general approach is a policy service. Bundle up all the data, send it off to a policy service, the service replies yes/no. #pepr20
The policy service approach allows updates to take place immediately. No consistency issues. #pepr20
The policy service is slower. It introduces a remote procedure call. This is now a distributed system, so you need to do capacity planning, have SREs, etc. #pepr20
As the attributes get more abstract, putting policies at the computation level rather than the data level, this gets more complex. #pepr20
Access is a special case. The general cases are: Should this computation proceed? Should this computation include this data? #pepr20
Several of these abstract attributes... Purpose, jurisdiction, public policy, internal policy. #pepr20
Jurisdiction, like all of these, isn't a static attribute, it's a function. If you're a US citizen traveling in EU, you're subject to EU traffic laws, but "can I bring this cheese home with me" is based on more than just where you physically are right now. #pepr20
Right to be forgotten is an example of this that affects tech companies. Users have expectations about information regarding them, even when it's nominally public information. #pepr20
Property records, tax records. Public information, but never designed to be queried horizontally across the whole population! #pepr20
Organizations might have internal policies. Administrative access is acceptable if and only if this user is part of the customer service team and they are currently assigned a ticket to work on this particular data. #pepr20
The biggest lesson Amanda has on this topic: We need to think beyond access. Many privacy policies are not just about who can access what. Many security policies are not just about who's in what group. They're about purposes and jurisdiction. #pepr20
If you dig down to the *why* these rules exist, you can express them in closed form... if you can evaluate these attributes directly. #pepr20
Legal agreements change over time. Regulations change. Even user expectations change. Ask *why* rules exist, to draw these out. #pepr20
This works best for policies that are executed infrequently (as computers perceive time). You want to apply it to broad, horizontal computation, not to stuff that happens in a tight loop. #pepr20
If you can't find the answer with what you know now... ask the user! Then record the answer, because it will be useful in future. #pepr20
Example: A break-glass flow, where somebody requests emergency authorization, and specifies a reason to be audited later, while they're in the midst of debugging an urgent reliability issue. #pepr20
And that's time! Amanda will be taking questions on Slack. #pepr20
Question: What tools are available now as opposed to writing your own policy server? #pepr20
Answer: There aren't a lot. Some of this stuff works okay in SELinux, as long as the policy you have fits into that framework. For the server-side question, large orgs roll their own. It's a lot of work. #pepr20
Open source work in this area: Open Policy Agent, Policy Server. OPA is the one Amanda is more familiar with. It has momentum and is flexible. It's cloud-centric, rather than on-premises. #pepr20
"Look into those first, before rolling your own, speaking of some of the scar tissue" from doing this... :) #pepr20
Two main types of work in rolling your own. First, classifying data - retroactively! Apply policies based on the location of the container data is stored in... things you knew about the data when you collected it get lost. #pepr20
Second, if you are using consent in an automated policy check, the consent needs to be captured somewhere that's programmatically accessible. Not, for example, a word processing document written by a lawyer. #pepr20
Next question (we missed what it is). Answer: Hard to have a policy language that describes how you will label the data. You need to do the labeling in advance. #pepr20
The labels themselves are generally not enough to deduce how the operation should go. Treat this as part of code review. #pepr20
Question: Can you use machine learning to help with these policies? Answer: No, in Amanda's experience. ML is useful, however, for identifying anomalies, unusual datasets or computation. #pepr20
ML is not useful for policy execution because policies are not phrased probabilistically. We don't say "We should protect most of our users". #pepr20
There is some research on going from an English-language policy description into a first draft of a policy implementation. #pepr20
Lorrie: The law tends to be written in natural language. Amanda: It's tempting as a computer scientist to try to translate the law into code, but that doesn't match how the law works. #pepr20
Question: How does APAC change the attack surfaces? Complexity?
Answer: Complexity is a big difference. Distributed access points is one of the trade-offs; having a policy service is subject to DoS and compromise. #pepr20
So policy servers have a fan-out problem, as other distributed systems do. RBAC has some of these same issues... anything but attributes used right on the spot does. #pepr20
Correction: That should have been ABAC in the prior tweet, not APAC. #pepr20
Moving on to session 2!
Privacy Architecture for Data-Driven Innovation. Derek Care and Nishant Bhajaria, both at Uber. #pepr20
It sounds like this is going to be a broad description of Uber's privacy program (or "privacy architecture"). #pepr20
This data creates risks and costs. Companies have trouble assessing the risk until the data is already collected. Privacy tends to get considered only much later. #pepr20
(Note: The prior tweet, like all these, is the speaker's position, not ours. ;) ) #pepr20
Privacy needs to be a cross-functional effort, across the whole enterprise.
Security and privacy are not the same thing, though security is necessary for privacy. #pepr20
Privacy deals with minimizing what you collect, how it's shared, internal misuse, external misuse... in addition to securing things. #pepr20
There are several steps Uber uses. Classify your data. Set governance standards - both of these are planning steps. Then you execute on them, inventorying your data and enforcing privacy rules. #pepr20
Data classification, step 1, answers questions like "what is this data" and "how sensitive is it". #pepr20
[Of course, it's only meant to be an example.] #pepr20
Next step: Identifying data handling requirements. "How can I protect this data?" #pepr20
[Okay, so here there is discussion of classifying data along multiple dimensions. Okay, good. :) ] #pepr20
This slide, with a graphic of an expanding funnel, is about the importance of classifying and inventorying data early in the process, because it gets harder the longer you wait. #pepr20
[We notice the session 2 Slack channel doesn't have much traffic yet, while the session 1 channel is still going. Maybe the schedule should have had breaks to wrap up the chat?] #pepr20
[Still probably better to have separate channels than not.] #pepr20
Why is data inventory vital? If you don't do it, it "frustrates the intent" of GDPR and similar laws, because you can't enforce rules. Another reason is cost! Storing data costs money, you need visibility into what data you must keep and what you can delete. #pepr20
Now some discussion of the system for keeping track of data inventory. At Uber they had to crawl databases and discover data sets they hadn't known existed. #pepr20
Once you know about the data sets, you need to give them human attention to add metadata and figure out the context in which they're used. #pepr20
Ooh, here's a slide with an architecture diagram! It shows among other components Uber's UMS, [something] management service. #pepr20
UMS is one of the data sources which feeds into scanners and classifiers. #pepr20
After the scanners have run, there's a database for data inventory. Then this feeds back into UMS. #pepr20
How to define metadata? Each individual term ... (talking too fast) There's a metadata registry and a metadata taxonomy. [Reminds us of ISO 11179.] #pepr20
Classification techniques - different techniques mix ML and human tagging in different ways. Depends on use-case. #pepr20
Learnings! (as a noun)
You really come to appreciate how much data you have. You also learn more about the types of data you have. #pepr20
What happens if you collect data faster than you can delete it? #pepr20
One benefit to having a deletion program is that you can improve data quality by deleting the stuff you don't need. Both cost and privacy benefit from this. #pepr20
As you look at data, you discover issues that limit how useful it is. For example, you might have analogous data under different names, in different tables. #pepr20
This is an iterative process that also winds up helping out the data scientists who use this data. #pepr20
Last third of the presentation will be recommendations on how to share data. #pepr20
Uber doesn't have a choice whether to share data. They are required to by law in many jurisdictions. Slides are going by very quickly... retention, privacy controls, making location coarser. K-anonymity helped them at one point. #pepr20
Lorrie comes on to apologize for the lag. Somehow they have someone working on addressing it - yay! #pepr20
Question: What if data that isn't sensitive becomes so over time?
Answer: The examples were oversimplified, the real process is iterative. Location data becomes more sensitive as it adds more decimal points. #pepr20
They "classify" data on an October/February cadence, but in the intervening six months they "inventory" data by adding tags. #pepr20
Oh! Lea commented in the Slack channel to say that this should all still be in the channel for the first session. Several talks go in one session. The next "session" will be after the break. #pepr20
[Obviously this is all experimental, so this observation is in a spirit of constructive feedback: It probably would have been better to have the channel names be shorter, it would be easier to make sense of.] #pepr20
Question: How do you manage incremental or progressive exposure of your company to new offerings?
Answer: Our entire data inventory offering was built in-house. #pepr20
Time was short so there wasn't a lot of discussion of external tooling. #pepr20
Next talk, the final one of the first session: "Fairness through experimentation: Inequality in A/B Testing", by Guillaume Saint-Jacques, lead of computational social science at LinkedIn. There's an arxiv link to the paper. #pepr20
Their vision: To create economic opportunity for every member of the global workforce. #pepr20
This includes people who weren't born with large professional networks (... what a scary thought that some people ARE), people who aren't good with technology, etc. #pepr20
This problem can't be solved by technology alone, but nonetheless, this talk is about R&D around inequality. #pepr20
A/B testing is important when trying to build an inclusive platform. #pepr20
Unfairness - unequal outcomes "without a mechanism that is deemed just". Suppose for example that you see a LinkedIn feature that a particular group of users barely use at all, while others use it a lot. #pepr20
When you see this kind of gap you need to identify its origin. That's how A/B testing can help - test every new feature for inequality. #pepr20
When you see a gap, it can come from a biased algorithm, but it can also come from people having different reactions to the same treatment. #pepr20
Even something as simple as a change to text can produce these outcomes. #pepr20
What are they testing? Two complementary things. First, a group-based approach, comparing engagement across groups. (It sounds like they're talking about demographics, but they're quite wisely avoiding saying that word.) #pepr20
Second, "inequality impact". This is similar to how economists think about income inequality. #pepr20
"Inequality" is a concept about the shape of the distribution. #pepr20
How do you measure inequality impact? Imagine that all your members, at the start, have one "useful conversation" per day. Now you add a new feature and each member has an average of two "useful conversations" per day... #pepr20
... but that top-level figure, the average, isn't enough. Some members might be having more "useful conversations", while others have fewer. If you only have average on your dashboard, you won't see this. #pepr20
They use the Atkinson inequality index, common in economics, to get past this. It's applicable to any metric. (The slide has the formula but we won't try to type it.) #pepr20
One of the nice properties of the Atkinson index is that, with some work, you can think of it as a "discount factor" on the utility of the change. #pepr20
Why is inequality defined this way? It lets them catch features that create unintended gaps, as well as features that close gaps. The group-based approach won't catch this unless you're already monitoring the groups affected by the change. #pepr20
They say this metric is scalable enough that it can be run on every experiment. #pepr20
They're open-sourcing this implementation. #pepr20
Because the formula is a sigma, it can be decomposed as an addition, which works well with map-reduce. #pepr20
They have a review council that looks at these metrics once a month across all experiments. They feed that data back to PMs. (It's unclear whether this review blocks launches.) #pepr20
This is bottom-up fairness work, not top-down. #pepr20
Okay, this is informational only. What they do is, when they identify significant changes they invite the product owners to the review meetings. #pepr20
They pay special attention to "business-neutral" experiments. Many experiments are run in the hope that there's no difference. For example, changes to the backend, you're hoping mostly that there's no regressions. #pepr20
They found that site speed helps inclusivity (makes sense). #pepr20
So even some of those backend changes turn out to not be neutral. Business neutral isn't neutral in terms of impact to users. #pepr20
Targeted notifications can reduce inequality (call us skeptical, everyone wants to justify their ads, but okay). #pepr20
These effects, both positive and negative, are often unintended. Feature teams are often surprised by them. #pepr20
They're building a list of lessons they've learned from this process (legally brave!) #pepr20
Background material: Graph theory. How structurally diverse is a user's professional network? A more open, structurally diverse network, bridging two different communities, is better for you - well known by social scientists. #pepr20
Experiment wrt a recenly launched LinkedIn feature: InstaJobs. People get push notifications about job openings. This made a positive impact on job interactions, and also a reduction in inequality. #pepr20
They looked at inequality evaluated across a social capital metric, and found that there was more benefit to users with closed networks than to users with open networks. #pepr20
There's a GitHub link but it went by too fast. #pepr20
Lots of the comments on Slack are not questions. ;) #pepr20
Question: If equality = engagement, being able to target a larger group is good for business. What if the opposite happens? #pepr20
Answer: You can have the average and the inequality change together, or separately. Each can either fall or rise, so it's a 2x2 matrix of possibilities. #pepr20
When an increase in inequality is expected, they don't consider it a problem. For example, a feature that's meant to help influencers. #pepr20
There needs to be a human reasoning component here. This is why they care about the "inequality + mechanism" part. #pepr20
[There are some important philosophical conversations that could happen around that point.] #pepr20
Question: Who decides? Is it an individual, is there a "data fairness board"?
Answer: This is in an informational phase. The council doesn't block things. People from all over the company come to it. The social impact team, engineering, feature owners. #pepr20
Question: What about having members of the community on the board?
Answer: The company's thinking about it but they are unable to share information about it right now. #pepr20
Okay! That's the end of the first session. There's now a break. This is lunch on the east coast. We have 19 minutes. Make sure to check the post in the #pepr channel about the video posting issue. #pepr20
After the break, we'll continue over on this new thread. #pepr20
We're live-tweeting PEPR20! After the break, this will be the thread head for the fourth block of talks ("session"), which will be the last one for the first day. #pepr20
"Product Privacy Journey: Towards a Product Centric Privacy Engineering Framework", by Igor Trindale Oliveira, is now starting.
Why a product-centric approach? Other possible focuses would be compliance, design, engineering, users... #pepr20
Okay! This will be the thread head for the third session of #pepr20, which will re-convene after the birds-of-a-feather breakout sessions, in about ten minutes.
We have a secret motive for tweeting this, it helps us pay attention. Our brain doesn't cling to things unless we're using *all* of our brain.
Okay, the theme of this next block of talks ("session") is design. So now we're on slack channel 3. #pepr20
Just to keep our tweets organized, this will be the thread topper for our live-tweet of session 2 of #pepr20, when the break is over.
Okay! We're back from break. The talk title went by very quickly, ... now there's a pause, hopefully the speaker will introduce themselves again. #pepr20
According to the schedule, this one should be "Building and Deploying a Privacy Preserving Data Analysis Platform", by Frederick Jansen. #pepr20
Just so people know, if you're a trans person working any sort of professional job and you're interested in advocating to your company about healthcare, we're happy to chat privately about what to ask for and how.
We were heavily involved in efforts around that during our time at Google, and there's a lot of transferable knowledge that applies to any company.
Belatedly, we realized that because we DO have that highly detailed knowledge on this topic, we should directly talk about Discord's thing.
The thing you have to understand about America is that anyone who grew up there, grew up being fed propaganda that most of us took at face value. That sounds like an extreme position, but it's the literal truth.
The US mythologizes its own impact on the world, focusing only on the positives and glossing over the negatives.
The US mythologizes its own *place* in the world, declaring itself a leader in all sorts of things - public health; infrastructure; democracy - where it is nothing of the sort, and has not been for a long, long time.
Here's a thought we've shared privately, but it's taken time for us to get a formulation of it that doesn't ramble too much.
When people talk about working for change "within the system" vs. "outside the system", what system do they mean?
Answer: It depends! A thread.
People without a science background, or even people with that background who don't also pay attention to the humanities, may not realize that the word "system", in its modern sense, had to be invented. It wasn't a single moment, either, the idea was refined over many years.
Wikipedia has a page that's titled just "System", because it's a more interesting concept than you might realize. en.wikipedia.org/wiki/System#Hi…