Tweet

Irenes (many)

15 Oct, 146 tweets, 42 min read

Okay! We will be live-tweeting #PEPR20, the USENIX conference on Privacy Engineering Practice and Respect. Feel free to mute that hashtag if you don't want to drown in tweets.

@LeaKissner

@LeaKissner is now keynoting! #pepr20

Lea says the conference was gonna be in California back in May, but then 2020 happened so here we all are in October dialing in from home. #pepr20

That list of organizers is longer than last year's, we think. Nice. :) #pepr20

Lorrie Cranor is talking now about how the conference will go. There's a Slack. (We're there!) #pepr20

[CW food] There will be a virtual ice cream social tomorrow. #pepr20

In addition to a long-running "hallway track" channel on Slack, there will be two "birds of a feather" sessions, one at 11am US/Pacific today, one the same time tomorrow. #pepr20

It had already been announced by email that Usenix can't host PEPR (among other conferences) in 2021, due to their own funding issues. PEPR will continue, there will be another venue. Details when they're ready, but not yet. #pepr20

pepr.tech will have the details for PEPR21 when they exist. They could use sponsors and volunteers! #pepr20

There will be a survey after the conference. Fill it out. #pepr20

There's one Slack channel per session, in addition to some general channels. Interesting choice. #pepr20

Amanda Walker will be giving the first session! #pepr20

Amanda Walker: "Beyond Access: Using ABAC frameworks to implement privacy and security frameworks." #pepr20

Amanda is unfortunately not at liberty to give code snippets due to corporate confidentiality, so this will be more of a lessons learned talk. #pepr20

A history of access control. Physical, identity, role, context. #pepr20

UNIX permissions are the classic example of identity-based access control. #pepr20

Role-based access control applies business rules to access. In addition to identity, there's now a concept of role. Group membership is one version of this. #pepr20

Even RBAC can be a problem. Group membership gets used as a proxy for other attributes of the user. For example, badging into an office in the US adds users to a "currently in the US" group. #pepr20

These tricks *work*, but they don't *scale*. Updating a large database takes time. #pepr20

"Around the turn of the century", heh... #pepr20

With attribute based access control, we look at other properties of a user beyond just their roles. OASIS XACML (2001) was an early language for describing these policies. Not a mechanism, just a language. #pepr20

NIST put out a standard in 2014. Microsoft has SDDL. There are others. #pepr20

These ABAC systems (we're gonna mistype that as ACAB, we just know it) ... these systems have a few common types of attributes. Subject, action, object, context. #pepr20

The subject is who's requesting access. The action is what they're trying to do. The object is what they're doing it to. The context is stuff like time of day. #pepr20

There have been two different approaches to this. The first is to put attributes on objects. This is fast, it lets rules be evaluated inline. #pepr20

Implementations of this approach will often have the policy written directly in the code. This is an advantage! ... It is also a disadvantage. It makes things hard to refactor. #pepr20

Even if you centralize this in a library, changes to it still require rebuilding and pushing new binaries. #pepr20

Not everything you want to know about data remains static over its lifetime. #pepr20

The other, more general approach is a policy service. Bundle up all the data, send it off to a policy service, the service replies yes/no. #pepr20

The policy service approach allows updates to take place immediately. No consistency issues. #pepr20

The policy service is slower. It introduces a remote procedure call. This is now a distributed system, so you need to do capacity planning, have SREs, etc. #pepr20

As the attributes get more abstract, putting policies at the computation level rather than the data level, this gets more complex. #pepr20

Access is a special case. The general cases are: Should this computation proceed? Should this computation include this data? #pepr20

Several of these abstract attributes... Purpose, jurisdiction, public policy, internal policy. #pepr20

Jurisdiction, like all of these, isn't a static attribute, it's a function. If you're a US citizen traveling in EU, you're subject to EU traffic laws, but "can I bring this cheese home with me" is based on more than just where you physically are right now. #pepr20

(Cool example!) #pepr20

Right to be forgotten is an example of this that affects tech companies. Users have expectations about information regarding them, even when it's nominally public information. #pepr20

Property records, tax records. Public information, but never designed to be queried horizontally across the whole population! #pepr20

Organizations might have internal policies. Administrative access is acceptable if and only if this user is part of the customer service team and they are currently assigned a ticket to work on this particular data. #pepr20

The biggest lesson Amanda has on this topic: We need to think beyond access. Many privacy policies are not just about who can access what. Many security policies are not just about who's in what group. They're about purposes and jurisdiction. #pepr20

If you dig down to the *why* these rules exist, you can express them in closed form... if you can evaluate these attributes directly. #pepr20

Legal agreements change over time. Regulations change. Even user expectations change. Ask *why* rules exist, to draw these out. #pepr20

This works best for policies that are executed infrequently (as computers perceive time). You want to apply it to broad, horizontal computation, not to stuff that happens in a tight loop. #pepr20

If you can't find the answer with what you know now... ask the user! Then record the answer, because it will be useful in future. #pepr20

Example: A break-glass flow, where somebody requests emergency authorization, and specifies a reason to be audited later, while they're in the midst of debugging an urgent reliability issue. #pepr20

And that's time! Amanda will be taking questions on Slack. #pepr20

Question: What tools are available now as opposed to writing your own policy server? #pepr20

Answer: There aren't a lot. Some of this stuff works okay in SELinux, as long as the policy you have fits into that framework. For the server-side question, large orgs roll their own. It's a lot of work. #pepr20

Open source work in this area: Open Policy Agent, Policy Server. OPA is the one Amanda is more familiar with. It has momentum and is flexible. It's cloud-centric, rather than on-premises. #pepr20

"Look into those first, before rolling your own, speaking of some of the scar tissue" from doing this... :) #pepr20

Two main types of work in rolling your own. First, classifying data - retroactively! Apply policies based on the location of the container data is stored in... things you knew about the data when you collected it get lost. #pepr20

Second, if you are using consent in an automated policy check, the consent needs to be captured somewhere that's programmatically accessible. Not, for example, a word processing document written by a lawyer. #pepr20

Next question (we missed what it is). Answer: Hard to have a policy language that describes how you will label the data. You need to do the labeling in advance. #pepr20

The labels themselves are generally not enough to deduce how the operation should go. Treat this as part of code review. #pepr20

Question: Can you use machine learning to help with these policies? Answer: No, in Amanda's experience. ML is useful, however, for identifying anomalies, unusual datasets or computation. #pepr20

ML is not useful for policy execution because policies are not phrased probabilistically. We don't say "We should protect most of our users". #pepr20

There is some research on going from an English-language policy description into a first draft of a policy implementation. #pepr20

Lorrie: The law tends to be written in natural language. Amanda: It's tempting as a computer scientist to try to translate the law into code, but that doesn't match how the law works. #pepr20

Question: How does APAC change the attack surfaces? Complexity?

Answer: Complexity is a big difference. Distributed access points is one of the trade-offs; having a policy service is subject to DoS and compromise. #pepr20

So policy servers have a fan-out problem, as other distributed systems do. RBAC has some of these same issues... anything but attributes used right on the spot does. #pepr20

Correction: That should have been ABAC in the prior tweet, not APAC. #pepr20

Moving on to session 2!

Privacy Architecture for Data-Driven Innovation. Derek Care and Nishant Bhajaria, both at Uber. #pepr20

It sounds like this is going to be a broad description of Uber's privacy program (or "privacy architecture"). #pepr20

Modern companies collect a lot of data! #pepr20

This data creates risks and costs. Companies have trouble assessing the risk until the data is already collected. Privacy tends to get considered only much later. #pepr20

(Note: The prior tweet, like all these, is the speaker's position, not ours. ;) ) #pepr20

Privacy needs to be a cross-functional effort, across the whole enterprise.

Security and privacy are not the same thing, though security is necessary for privacy. #pepr20

Privacy deals with minimizing what you collect, how it's shared, internal misuse, external misuse... in addition to securing things. #pepr20

There are several steps Uber uses. Classify your data. Set governance standards - both of these are planning steps. Then you execute on them, inventorying your data and enforcing privacy rules. #pepr20

Data classification, step 1, answers questions like "what is this data" and "how sensitive is it". #pepr20

Example classification, "very basic". Tier 1 highly restricted, tier 2 restricted, tier 3 confidential, tier 4 public.

[This seems overly simplistic, just in our personal opinion. There won't always be an ordering.]

#pepr20

[Of course, it's only meant to be an example.] #pepr20

Next step: Identifying data handling requirements. "How can I protect this data?" #pepr20

[Okay, so here there is discussion of classifying data along multiple dimensions. Okay, good. :) ] #pepr20

This slide, with a graphic of an expanding funnel, is about the importance of classifying and inventorying data early in the process, because it gets harder the longer you wait. #pepr20

[We notice the session 2 Slack channel doesn't have much traffic yet, while the session 1 channel is still going. Maybe the schedule should have had breaks to wrap up the chat?] #pepr20

[Still probably better to have separate channels than not.] #pepr20

Why is data inventory vital? If you don't do it, it "frustrates the intent" of GDPR and similar laws, because you can't enforce rules. Another reason is cost! Storing data costs money, you need visibility into what data you must keep and what you can delete. #pepr20

Now some discussion of the system for keeping track of data inventory. At Uber they had to crawl databases and discover data sets they hadn't known existed. #pepr20

Once you know about the data sets, you need to give them human attention to add metadata and figure out the context in which they're used. #pepr20

Ooh, here's a slide with an architecture diagram! It shows among other components Uber's UMS, [something] management service. #pepr20

Ah, UMS is the Uber Metadata Store. #pepr20

UMS is one of the data sources which feeds into scanners and classifiers. #pepr20

After the scanners have run, there's a database for data inventory. Then this feeds back into UMS. #pepr20

How to define metadata? Each individual term ... (talking too fast) There's a metadata registry and a metadata taxonomy. [Reminds us of ISO 11179.] #pepr20

Classification techniques - different techniques mix ML and human tagging in different ways. Depends on use-case. #pepr20

Learnings! (as a noun)

You really come to appreciate how much data you have. You also learn more about the types of data you have. #pepr20

What happens if you collect data faster than you can delete it? #pepr20

One benefit to having a deletion program is that you can improve data quality by deleting the stuff you don't need. Both cost and privacy benefit from this. #pepr20

As you look at data, you discover issues that limit how useful it is. For example, you might have analogous data under different names, in different tables. #pepr20

This is an iterative process that also winds up helping out the data scientists who use this data. #pepr20

Last third of the presentation will be recommendations on how to share data. #pepr20

Uber doesn't have a choice whether to share data. They are required to by law in many jurisdictions. Slides are going by very quickly... retention, privacy controls, making location coarser. K-anonymity helped them at one point. #pepr20

Lorrie comes on to apologize for the lag. Somehow they have someone working on addressing it - yay! #pepr20

Question: What if data that isn't sensitive becomes so over time?

Answer: The examples were oversimplified, the real process is iterative. Location data becomes more sensitive as it adds more decimal points. #pepr20

They use a repertoire of tags. #pepr20

They "classify" data on an October/February cadence, but in the intervening six months they "inventory" data by adding tags. #pepr20

Oh! Lea commented in the Slack channel to say that this should all still be in the channel for the first session. Several talks go in one session. The next "session" will be after the break. #pepr20

[Obviously this is all experimental, so this observation is in a spirit of constructive feedback: It probably would have been better to have the channel names be shorter, it would be easier to make sense of.] #pepr20

Question: How do you manage incremental or progressive exposure of your company to new offerings?

Answer: Our entire data inventory offering was built in-house. #pepr20

Time was short so there wasn't a lot of discussion of external tooling. #pepr20

Next talk, the final one of the first session: "Fairness through experimentation: Inequality in A/B Testing", by Guillaume Saint-Jacques, lead of computational social science at LinkedIn. There's an arxiv link to the paper. #pepr20

Their vision: To create economic opportunity for every member of the global workforce. #pepr20

This includes people who weren't born with large professional networks (... what a scary thought that some people ARE), people who aren't good with technology, etc. #pepr20

This problem can't be solved by technology alone, but nonetheless, this talk is about R&D around inequality. #pepr20

A/B testing is important when trying to build an inclusive platform. #pepr20

Unfairness - unequal outcomes "without a mechanism that is deemed just". Suppose for example that you see a LinkedIn feature that a particular group of users barely use at all, while others use it a lot. #pepr20

When you see this kind of gap you need to identify its origin. That's how A/B testing can help - test every new feature for inequality. #pepr20

When you see a gap, it can come from a biased algorithm, but it can also come from people having different reactions to the same treatment. #pepr20

Even something as simple as a change to text can produce these outcomes. #pepr20

What are they testing? Two complementary things. First, a group-based approach, comparing engagement across groups. (It sounds like they're talking about demographics, but they're quite wisely avoiding saying that word.) #pepr20

Second, "inequality impact". This is similar to how economists think about income inequality. #pepr20

"Inequality" is a concept about the shape of the distribution. #pepr20

How do you measure inequality impact? Imagine that all your members, at the start, have one "useful conversation" per day. Now you add a new feature and each member has an average of two "useful conversations" per day... #pepr20

... but that top-level figure, the average, isn't enough. Some members might be having more "useful conversations", while others have fewer. If you only have average on your dashboard, you won't see this. #pepr20

They use the Atkinson inequality index, common in economics, to get past this. It's applicable to any metric. (The slide has the formula but we won't try to type it.) #pepr20

One of the nice properties of the Atkinson index is that, with some work, you can think of it as a "discount factor" on the utility of the change. #pepr20

Why is inequality defined this way? It lets them catch features that create unintended gaps, as well as features that close gaps. The group-based approach won't catch this unless you're already monitoring the groups affected by the change. #pepr20

They say this metric is scalable enough that it can be run on every experiment. #pepr20

They're open-sourcing this implementation. #pepr20

Because the formula is a sigma, it can be decomposed as an addition, which works well with map-reduce. #pepr20

They have a review council that looks at these metrics once a month across all experiments. They feed that data back to PMs. (It's unclear whether this review blocks launches.) #pepr20

This is bottom-up fairness work, not top-down. #pepr20

Okay, this is informational only. What they do is, when they identify significant changes they invite the product owners to the review meetings. #pepr20

They pay special attention to "business-neutral" experiments. Many experiments are run in the hope that there's no difference. For example, changes to the backend, you're hoping mostly that there's no regressions. #pepr20

They found that site speed helps inclusivity (makes sense). #pepr20

So even some of those backend changes turn out to not be neutral. Business neutral isn't neutral in terms of impact to users. #pepr20

Targeted notifications can reduce inequality (call us skeptical, everyone wants to justify their ads, but okay). #pepr20

These effects, both positive and negative, are often unintended. Feature teams are often surprised by them. #pepr20

They're building a list of lessons they've learned from this process (legally brave!) #pepr20

Background material: Graph theory. How structurally diverse is a user's professional network? A more open, structurally diverse network, bridging two different communities, is better for you - well known by social scientists. #pepr20

Experiment wrt a recenly launched LinkedIn feature: InstaJobs. People get push notifications about job openings. This made a positive impact on job interactions, and also a reduction in inequality. #pepr20

They looked at inequality evaluated across a social capital metric, and found that there was more benefit to users with closed networks than to users with open networks. #pepr20

There's a GitHub link but it went by too fast. #pepr20

Lots of the comments on Slack are not questions. ;) #pepr20

Question: If equality = engagement, being able to target a larger group is good for business. What if the opposite happens? #pepr20

Answer: You can have the average and the inequality change together, or separately. Each can either fall or rise, so it's a 2x2 matrix of possibilities. #pepr20

When an increase in inequality is expected, they don't consider it a problem. For example, a feature that's meant to help influencers. #pepr20

There needs to be a human reasoning component here. This is why they care about the "inequality + mechanism" part. #pepr20

[There are some important philosophical conversations that could happen around that point.] #pepr20

Question: Who decides? Is it an individual, is there a "data fairness board"?

Answer: This is in an informational phase. The council doesn't block things. People from all over the company come to it. The social impact team, engineering, feature owners. #pepr20

Question: What about having members of the community on the board?

Answer: The company's thinking about it but they are unable to share information about it right now. #pepr20

Okay! That's the end of the first session. There's now a break. This is lunch on the east coast. We have 19 minutes. Make sure to check the post in the #pepr channel about the video posting issue. #pepr20

After the break, we'll continue over on this new thread. #pepr20

https://twitter.com/ireneista/status/1316778197636190209

https://twitter.com/ireneista/status/1316778197636190209

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Irenes (many)

Try unrolling a thread yourself!

More from @ireneista

Irenes (many)

Irenes (many)

Irenes (many)

Irenes (many)

Irenes (many)

Irenes (many)

Did Thread Reader help you today?

Like this author's thread?