Tweet

Irenes (many)

Jun 23 • 190 tweets • 15 min read

Okay! Today and tomorrow we will once again be live-tweeting #PEPR22, the Usenix conference on Privacy Engineering Practice and Respect.

For timing reasons we didn't get to see the first session. The second one, "Expanding Differentially Private Solutions: A Python Case Study" is now ongoing.

The presenter is Vadym Doroshenko, affiliated with Google. They don't seem to have said explicitly, but this does sound like it's describing work that was done at Google (we're only basing that on the slides).

The presentation is starting out with an example data set, and an example SQL query to be run on it.

The presenter is now describing an architecture which executes queries in ways that enforce DP. They have a website, PipelineDP. Maybe this *isn't* a Google thing?

(We're not managing to type out everything! Ah well. It's a best effort.)

Oh, these are lightning talks. :)

Now we have a talk about "Tumult Platform", a forthcoming open-source platform for safely releasing aggregate data.

They have four main goals with the platform. To be easy to use by people who aren't experts in privacy; to provide provable guarantees of differential privacy; ...

Implicit in the second goal, they explain, is that they aren't treating the data analyst as an adversary for this work.

That is, there will be an analyst who works to prepare the data, and that person is trusted.

Last two goals: It's extensible, and it scales to as much data as you need it to.

The platform is organized into layers, corresponding with the design goals. The top layer is analytics, intended for data scientists without DP expertise. It's an API in Python.

The API helps the analyst understand how they're spending their privacy budget.

They have features to help the analyst get as much accuracy as possible (ie. add as little noise to the data as needed).

They also have features to do complex transformations on the data, including "flatmaps" (turn a row into many rows), joins, and other relational operations.

Corresponding to their middle two goals, their architecture has a "core" layer.

Core is a collection of components that transform and measure data, and can be composed together.

When components are composed, the privacy properties of the composed thing are derived from the properties of the components that make it up.

The authors have published some work elsewhere on mitigating floating point privacy vulnerabilities. That work is incorporated in this layer.

The functionality of the core layer is abstract and hard to use; the analytics layer is meant to make it usable by providing recipes to use.

Regarding scalability, they've tested up to a billion rows and runtime seems to scale linearly with that. (Does nobody actually analyze runtime, just measure it? Ah well :)

Their github username is tumult-labs and it's available now!

Excuse us. Their *gitlab* username. gitlab.com/tumult-labs/

One more lightning talk in this session.

"Compiling Python Programs into Differentially Private Ones", by Johan Leduc and Nicolas Grislain.

Their platform has an interface where users can import data as CSV or as an sqlite database.

There's additional UI where users can review (and edit?) the schema for the imported data, and generate synthetic data that matches the same rules.

The workflow they provide involves someone they call the "data partitioner". This is *not* a trusted role, this person sees synthetic data.

They don't want to reimplement Python's many data analysis libraries, so instead they implement stubs which make Pandas (and other APIs? it's unclear) work with their dataset.

This slide shows a short code snippet, and a directed acyclic graph which is the data flow that was extracted from the snippet.

They are now introducing the notion of "adjacent datasets" from differential privacy. These are two datasets which differ by only one entry.

A differentially private mechanism doesn't depend too much on the contribution of a single entity. They formalize this notion by defining a concept of a "protected entity" and identifying it during the setup steps.

They say a dataset is PE preserving if each row is linked to at most one PE.

Transformations on the data are PE preserving if they don't mess that up. Selecting columns is PE preserving. Aggregate functions such as mean and standard deviation are *not*, since they mix many PEs together.

Looking back at the data flow graph, how does their algorithm use this to get a differentially private estimate of the result?

They recursively go up through the graph and make sure, at each node, that the inputs to it are PE preserving.

Now an example! This is a pretty abstract graph of a model they want to train with DP-SDG.

They show how their algorithm will transform standard deviation into a differentially private version of standard deviation, tracking epsilon and delta flowing from the node above that.

That's the talk! Now a combined Q&A for the last three lightning talks.

Participants on Slack are also able to submit questions.

Here's a question about the middle talk, why are hostile analysts out of scope?

Answer: It would make the platform much harder to use. An adversarial analyst is very powerful and can, for example, exploit timing channels to learn stuff about individual rows.

Same question for the others, can you talk more about trust relationships in your platforms?

Re: the first talk, similar reasoning. Protecting against these threats would be very difficult. PythonDP is about computations, and only offers a pretty limited set of stuff, none of which is really focused on data management.

Re: third talk, they do have a clear separation between the data owner and the analyst. The data never leaves the system and is accessed only through their UI.

Next question. As more analysts within an organization use these tools, how do you imagine they'll coordinate their use of the privacy budget? All three speakers will respond.

(We have prosopagnosia - face blindness - so we're unlikely to be able to tell you who's speaking in this situation. Sorry! Think of it as having run out of privacy budget. :))

This is a great question to which none of the frameworks have a solution yet, but note that organizations that have been doing this stuff already have to make these calls. For example, the IRS already has less formal ways to do it.

Next question. Usability - have they done user testing with data scientists? If not, how do they evaluate that?

First answer. The data quality is usually enough that they can get an estimate, then send it off for remote execution, where the privacy parameters are set by them based on their experience.

Second answer. They have not done formal user studies, they're a startup. They have done informal usability with trial users and have learned from it, including asking trial users to try a few things side by side.

Third answer. They have built on top of previous systems, and have done user studies on those. Some of that was taken into consideration in their design.

(Yes, we're even more disoriented than you are, not knowing which answer goes with which talk. Ah well :))

Long question: There's a tradeoff between identifying users in order to apply DP. How do we think about the implications, not just of the model outputs, but of collecting that information in the first place?

Vadym (first talk): They offer some options, how to bound sensitivities of queries. It's up to the user how to do that.

Johan (third talk): They're downstream of this question. The data comes and it's the role of the data owner to decide what is a protected entity.

Michael (second talk): A DP mechanism doesn't just provide a single guarantee, it provides a family of guarantees. Even if you don't know exactly how many records one user contributed, you can model the uncertainty.

In a newspaper example, the users you most want to protect are probably the editors who contributed many records.

Last question: These platforms all seem similar, why choose one? "You have 40 seconds each!"

Johan: The data partitioner will never see the real data and doesn't have access, it's an important feature.

Michael: Try them all, decide for yourself. Ours is extensible.

Vadym: Try them all, it's hard to decide just from an advertisement. Play with it and make a decision whether it's for you. Ours is for non-experts, which could help.

Applause.

Now there's a 30-minute break. We'll be back!

We're so jealous of all y'all who are attending in person this year. We decided not to - COVID - but we really miss this.

We joked about sending a telepresence robot to do hallway track for us. Maybe next year - it would have required some planning.

Okay! Next session starting now! The topic of these next few talks is threat modeling. #pepr22

Cara Bloom, from MITRE, will speak about "Privacy Threat Modeling".

"Privacy risks" include "privacy threats", "privacy vulnerabilities", and "privacy attacks".

A privacy attack is "actions or inactions that causes a perceived harm" and aren't "solely" about security

cites Solove's Taxonomy of Privacy as a list of harms

Venn diagram: cybersecurity risks intersects with privacy risks. the scope of this work is only the privacy risks, not the cybersecurity part or the intersection.

(It should be clear, but we're quoting the speaker. Our friends know we're not fans of tacking "cyber" on as an adjective to things.)

Example: Lenovo Superfish. Was this a security attack? Their laptops came pre-bundled with software that intercepted user internet traffic. The harms from this are "insecurity" and "intrusion".

Example: Cambridge Analytica. Demonstrates the power of privacy attacks that aren't security violations.

"Privacy risk management is the exception, not the rule". Most organizations are focused on compliance rather than on risk management.

They're hoping that focusing on threats will change the narrative there.

(We have personal feelings on this! No time to get into them now though.)

Things without intent can still be privacy attacks, if they cause privacy harm.

They've built a database of privacy attacks. From this they've built a privacy threat taxonomy, a hierarchical ontology. They've also built privacy threat clusters.

They're taking these together and analyzing them into a "privacy kill chain". There will be an example in a moment.

(we *really* do not like this military language. it's dangerous. oh well - right now we're just trying to report on what's being discussed)

so they break attacks into individual "threat actions". they then group these actions into similar "activities". then they tweak things to minimize gaps and overlaps.

they're also facing challenges in how to name things. they will keep going until the taxonomy stabilizes.

top level of the taxonomy is 13 activities. notice, consent, collection, insecurity, identification, quality assurance, manageability, aggregation, processing, sharing, use, retention & destruction, deviations.

there was an example drilling down into one of these categories, but it was too much detail to transcribe.

okay, example: data de-identification threat pattern. "whoever put us after the DP group is pretty good." *laughter*

so here some stuff in the taxonomy is highlighted. what's essential for this pattern is that the data is not de-identified properly.

sequence DAG: collection -> identification -> (QA, processing) -> sharing. in QA "data not de-identified", in processing "insufficient de-identification"

aggregation is also implicated in this example, but not essential to the "kill chain"

now here's the threat clusters associated with Superfish. insecurity, preemption, behavioral advertising.

... and here's what it looks like on the taxonomy (slide is very complicated)

the hope is that this categorization makes it easier to understand threats and facilitates discussion about attacks

they've seen similar models for security (probably referring to CWEs and so on)

to use it for threat modeling you can do two things. either build a data set of attacks and map them to the taxonomy, or map the stuff users can do in your system onto it to ask "could this be a threat action"

either way, you then figure out which threat patterns are relevant to you, and disrupt them by creating privacy mitigations

they're building a web app to help you do this

they hope that this will become part of your risk management and help you figure out how to prioritize mitigations

they expect that privacy will follow the same path security did wrt getting more formalized

(we personally agree that privacy should get more formalized, we're just concerned because ethics is a *process* not a set of answers, so there needs to be room for human judgement, too)

talk's done! Q&A now

Q: differences between threats, harms?

A: that's why they said "attack", a flaw in a system and a consequence/outcome

the steps that are taken as part of the attack are the threat

a threat is not a flaw in a system, it's not the outcome, it's the steps that lead to it

Q: how does this relate to the NIST privacy framework and legal compliance?

A: not a compliance tool. this is for risk modeling.

what you're doing with the NIST framework is looking at vulnerabilities and flaws. this is looking at exploits.

privacy impact assessments are a consequence model. this is for privacy *threat* assessments.

next talk! "Privacy Audits 101", Lauren Reid, currently a consultant, formerly part of Google's Sidewalk Labs, the Toronto cameras-everywhere thing.

context: what do privacy audits mean? people use that word in many different ways. this is about auditing a system, process, or product based on stated goals, using evidence to assess if controls work. NOT an investigation of an outage or a penetration test.

why? maybe following privacy regulations. maybe you're a company subject to an FTC consent decree requiring you to do this.

(the US has no law requiring audits, but GDPR does in the EU, and Canada also has a law about it)

what is an audit? do we have evidence that this thing is doing what it said it would?

in an audit, even if the outcome is correct, if there's no evidence it can't pass

the scope has to be narrowly defined. a process, a system, an app. not the organization as a whole.

the auditor doesn't set the objectives because this wouldn't be independent. the audit checks against the stated objectives.

audits are backward-looking. it doesn't answer "will this work", only whether there is evidence that it DID work.

example: PWC audit of Facebook in 2017. rare example of a publicly available report.

good example of limitations. FB was already aware of the Cambridge Analytica leak when this was issued.

in this case, PWC is the auditor. as part of the FTC settlement FB had to hire them.

FB's privacy controls are mostly redacted in the public version, but will look at the ones that pertain to 3rd-party devs.

the auditors said FB's controls were sufficiently effective

"based upon the Facebook Privacy Program set forth in management's assertion"

FB wrote a thing, the management assertion, which they provided to PWC as part of the audit process

what if, instead of doing this, they had used an existing privacy framework such as AICPA Trust Services Criteria (the speaker's favorite), the NIST Privacy Framework, or ISO 27701?

so here's a slide comparing what FB said in the assertion and what they would have had to say in these frameworks

very detail-heavy slides, can't possibly transcribe these, sorry!

are privacy audits effective? let's apply it to itself: do we have evidence that this thing was doing what it said it would do?

the audit held that those controls were operating sufficiently (not perfectly). the audit did what it said it would do; audits have a purpose, they can't do everything.

just one question before moving on.

Q: in security, there are specs for how to do these assessments. where are we going wrt standards for privacy audits?

Q elaborated: consent is an abstract concept. is it informed? does putting it in a privacy notice suffice?

A: sometimes we think auditing against the privacy principles will show that those principles were achieved. privacy is subjective. is it appropriate? is it necessary? is consent informed? the answer is not an audit. we are expecting too much from audits.

next talk: Eivind Arvesen, "Privacy Design Flaws".

he was involved in Norway's COVID app. he'll be speaking about common design problems.

his employer is Sector Alarm, a European home safety company

what are privacy flaws? flaws are in the *design* or architecture; bugs are in the *code*.

starting point for the talk: Gary McGraw's security flaws (spelling?)

this was a 2015 IEEE publication

motivations for the talk. privacy is an emergent property of the system. these flaws aren't easily found by looking at code.

there's a lack of explicit best practices around privacy, so it's hard for development teams without privacy expertise to do stuff. we need to turn the basics of privacy engineering into common knowledge.

we want to avoid "digitalizing ourselves to death" and keep society sustainable

flaws: false anonymity, data leakage, mistaking data protection for privacy, failing to consider contextual requirements, unclear or changing purposes... many others. interested in collaborating to develop this list.

mistaking data protection for privacy. that is, confusing secure data for being privacy friendly.

example: Smittestopp v1 (a COVID app). aggressive data collection, breaking with regulatory requirements and best practice.

the app's proponents argued that because the parties involved were trustworthy, there was no issue

the app continuously connected location and bluetooth data from every user at all times and fed this to a cloud server. this was well secured, but ...

the dev said "you have to trust the king" (Norway is a constitutional monarchy). "I don't know what to say to that."

in order to avoid these flaws, you want to build up your minimum privacy competence in your tech orgs

another flaw. false anonymity, example: the data in the Netflix Prize data set, early 2000s. researchers were able to deanonymize users.

the data set was movie ratings from subscribers; researchers correlated it with IMDB data.

researchers learned users' political preferences and other sensitive information. to avoid these flaws, use strong anonymization techniques!

another flaw: assumed trust. Cambridge Analytica. the Facebook system relied on assumptions of trust, which threat modeling could have identified.

CA built individual psychological profiles and used FB's graph API to get unusual rich information about users' friends

a limitation that isn't enforced isn't much of one, in practice. follow the principle of least privilege. earn trust or give trust but never assume trust.

flaw: data leakage. Android at one time logged contact tracing app info in system logs, which were readable by privileged apps including certain preinstalled ones

that could have been combined with other locally available information, such as advertising identifiers. no evidence this happened, but researchers found it could have.

need data classification schemes and policies, including logging policies.

we can see there are many classes of generalizable privacy defects. some of these flaws result in bad outcomes that we should be able to foresee.

the basics of privacy engineering are not widely disseminated in the ways software engineering and architecture are.

Privacy by design (Ann Cavoukian, 1995) identifies a lot of this

conclusion: as a field, we need to develop taxonomies, cheat sheets, standards, design patterns, and architectural references. then we can empower and enable non-privacy-experts.

last talk before lunch. "Privacy Shift Left: A Machine-Assisted Threat Modeling Approach" by Kristen Tan. the talk will be pre-recorded but there will be live questions.

respect to anyone who pre-records a talk. it's so much harder than speaking live.

what does it mean to shift privacy left?

traditional software development: requirements engineering -> architecture design -> software dev -> testing -> deployment -> maintenance

the goal is to move privacy and security considerations further "left" along this diagram, further back in the process

security and privacy traditionally fit somewhere between testing and deployment

that's not how it should be, they should be way back in requirements

how do we do that? threat modeling is one way we can introduce these topics earlier

one of the more effective frameworks in security threat modeling is STRIDE. spoofing, tampering, repudiation, information disclosure, denial of service, elevation of privilege.

LINDDUN, previously discussed, does the same for privacy. linkability, identifiability...

so threat modeling helps do this earlier, but what are the challenges?

time constraints. we are asking people to spend their time doing threat modeling.

we can use machine assistance. the computer can't take the place of the architect, but it can look at inputs and come up with a suggested list of threats.

but if we need a tool to help us, how do we pick one?

the speaker's team dug into this. they looked into open source tools that can be applied to any type of system, and a few other criteria

they then compared tools based on how complex their logic is, how much it can be customized, and a few other criteria that sound kinda synonymous with those

one tool they dug into was Threagile (pronounced like "threat agile")

"operator-sided data leakage" is one of the threats this tool can deal with, as defined in OWASP top 10 privacy threats

the tool has flowcharts for each of these.

the flowcharts do things like walking you through creating a list of all the protected components in your system, and identifying all the trust boundaries

huh. we're failing to understand something about where the flowcharts come from; this is now talking about turning them into code.

there's a lot of discussion of this one in Slack

Q: what about threats that can't be discovered through this kind of process?

A: this is not the only solution we need, it's just one piece

out of time, lunch now!

#pepr22 if you are attending virtually, come hang out in the hallway channel on Slack! let's "eat lunch" together! this informal stuff is one of the best parts of any conference.

after lunch there will be birds-of-a-feather sessions; we'll be running a virtual one for queer people and gender minorities. hope to see you there!

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Irenes (many)

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @ireneista

Irenes (many)

Irenes (many)

Irenes (many)

Irenes (many)

Irenes (many)

Irenes (many)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?