Tweet

Irenes (many)

Jun 24 • 406 tweets • 32 min read

Even with the horrible news, we are still attending #pepr22. Next up, we'll be livetweeting a very timely talk, "How Developers (Don't) Think about Gender Privacy", by Elijah Bouma-Sims.

The speaker has given some background on gender markers in software. They're now going over an example of a workplace software tool from Germany which chose the user's profile picture based on their answer to a gender question, without warning it would...

... the "female" picture was also the "diverse" one, which is questionable at best. Furthermore, it outed people because it broke their expectations about how visible the gender field would be to others.

The speaker's team analyzed discussions and code snippets on Reddit, and interviewed several developers, to explore how people think about gender while writing software.

They identified 11 programming-related subreddits and 11 search terms related to gender. They identified 917 posts with over 11,000 comments, but many of these posts were irrelevant.

They manually sorted through the data to identify interesting parts.

They found very little discussion of how to use gender. When people did ask, the advice they got was inclusive.

The vast majority of posts that involved gender being used in practice, however, were not inclusive. The most common representation of gender in code snippets was binary.

The criteria for people to interview were intentionally broad: Anyone with either a year of professional experience, or a related degree.

Most interviewees were not familiar with the topic area. That meant interviewers had to do a lot of explaining the questions.

One interviewee had previously worked on a social media platform specifically for trans people. That platform had no direct programmatic representation of gender, by design.

The researchers saw signs of positive developments, but conclude that deliberate standards need to be set.

They conclude (and we agree, as people who've spent years doing advocacy on this topic) that as much as possible, software shouldn't collect gender at all.

They further conclude (and, again, we agree) that when gender is collected, the purpose of the data collection should be made clear. For example, if it is to satisfy a legal requirement, that should be stated.

"If male and female are your only gender options, you're doing it wrong."

Q: Some would say that the best way to handle this is to ask for pronoun choices, but not collect gender explicitly. Can you comment on that?

A: Not asking is the best choice in many scenarios. You should ask for what you actually need. If you need pronouns, ask for them. If you need a profile picture, ask for that, don't try to extrapolate it from something else.

Even if you're not collecting it, ie. Twitter still targets advertising based on gender (the researcher expresses uncertainty, but we personally know this to be true).

Q: What about titles such as Mr., Mrs., etc?
A: We didn't consider it but it's worth looking at.

Q: How do you think culture plays a role? For example, in Japan pronoun collection is not the norm and users may not understand it?
A: Gender is a social construct, how it's expressed is governed by society. I didn't give a one size fits all solution because there isn't one.

Q: When is gender information really required?
A: In the slides, there's a great paper that looks particularly at this issue.

The conference is now on break for a little while, we'll be back.

Okay! #pepr22 is resuming. This next block of talks is about privacy at scale. The first one is "Data Mapping at a Billion Dollar Self-Driving Vehicle Startup", by Marc-Antoine Paré.

The talk is beginning without slides, due to technical issues.

"Why is privacy a challenge for autonomous vehicles"? They're covered with sensors! Cameras, microphones, inside and outside the vehicle. Short video clip of a car driving down the street, painting LIDAR lines on everything it passes.

One of the first challenges they noticed was a lack of visibility into the flow of data through their infrastructure. There's a lot of data, and a lot of ways that it gets copied around.

They identified four key lessons. Manual labeling isn't enough; mapping is context specific (road map vs topographic); aggressively trim scope; take use cases to the finish line.

The road map thing is a metaphor here, not literal. It's a way of saying that the way you map your data depends on the context. Some open source tools for data mapping are not suitable for privacy engineers.

The scope is a lot, it's too much to do. "I see a lot of heads nodding in the audience."

"You've labeled data, so what?" You have to make sure to actually do something with the labels.

They've built automated detectors for sensitive data. They made their data map for privacy engineering only. They focused first on two high-value projects, to learn from them before scaling.

They identified a use-case up front: Identify data that has been abandoned by its owners and is sensitive, and archive it.

Oooh, screenshot of their tool, "Indiana Explore"!

They've got a hierarchical namespace for datasets, and a list of fields for each one.

The tool lets you search and browse all the schemas.

They also let you filter by creation date, which they find useful to keep track of ongoing changes in the data warehouse.

The tool has a chart for verified labels, such as "phone number" (ie. somebody has verified that the field really contains phone numbers, as it appears to). You can search on those.

This tool is useful in responding to data requests from feature teams, because it lets them figure out what's sensitive and ask for reductions in scope.

They made a chart that is "quite powerful for communicating to senior leadership", showing "sensitive data findings by project" - aggregate counts broken down by data type.

(Very useful that they shared that, in our view. It's often a challenge to get corporate leadership to care.)

Where do verified labels come from? There's a whole data pipeline of scanners that run over both metadata and content. Column names are an input...

How do you scan data that privacy engineers don't have access to?

One approach: Give them access to everything. De facto this is what used to happen.

Most complex option: In-situ scanners - send a container with the scanner code to the team that does have access.

That's a significant security challenge, if you don't do it well it decreases security rather than increases it.

They have a secure data sampling service, which they call "trowel" because it's archaeology.

Wrapping that operation in a service gives them a bunch of mitigation options pretty quickly.

The last step of the label pipeline is a human in the loop.

They were able to scale this to 100% of their BigQuery projects, about 14 PiB of data.

They studied usage logs for everything that was tagged as sensitive. 55% of that was abandoned by its original owners!

However you scale the definition of abandoned, that number doesn't change much. "The half life of data is pretty short."

The speaker's computer was low on battery and the audience was stressed by it ;) so now it is plugged in.

It is often difficult to fully delete data. You'd love to, but if it was critical to a business workflow, your job gets much harder...

They had a lot of spin-off benefits for other teams, along the way. They were interested by this. It suggests that well executed data mapping isn't just for privacy or compliance, it's a foundational capability.

The next talk at #pepr22 is "Bringing Content Blocking to the Masses", by Shivan Kaul Sahib and Anton Lazarev.

They're speaking on behalf of their employer, Brave (the browser).

When they say content blocking, they mean what other people call ad blocking or tracker blocking.

It's difficult to maintain filter lists. It's also challenging to deploy it to millions of users.

37% of the web use ad blockers, per a 2016 study.

A lot of people, including the NSA and CISA, have acknowledged that ad blockers are important security tools to prevent data exfiltration.

What does it look like in practice? You go to a website and see an ad, you'd like to not see it, it turns out there was a script that inserted the ad... so you add a filter rule that blocks scripts by that name.

In Brave's developer tools, it shows which scripts the ad blocker blocked.

EasyList and EasyPrivacy are the two most actively maintained filter lists.

All the major ad blockers use these same lists.

The lists are hard to maintain for several reasons. Filter rules become obsolete, for example when the name changes.

This isn't necessarily malice, but it can be.

When rules go obsolete, this leads to a false sense of security.

How do they identify adblock evasion? It's an open problem. They try not to look for names, but for the behavior of scripts.

They create signatures of scripts as they execute, and use that to catch obfuscated, renamed, and inlined scripts.

Next problem: Filter rules can break websites.

They've brought up a code snippet.

The example snippet loads a remote script, then uses an inline call which passes a necessary, benign setup function as a callback.

In this example, blocking the ad also blocks the entire website.

All these privacy protecting mechanisms need to work out of the box, users shouldn't have to debug it.

They have lots of users who don't have the latest hardware and it isn't viable to just forget about them.

The solution was to create privacy-preserving script replacements. They rewrite scripts!

They shipped this, but it did run into some issues.

It wound up needing an entire gig of scripts... they rolled it out slowly, and on devices with low RAM, loading it would crash.

This work relies on crowdsourced filter data, so how do they support the community?

Ryan Brown is the most important contributor to these lists. They hired him.

They keep a close watch on feedback channels - GitHub, Twitter, Reddit, their own forums.

They have nightly and beta channels, where they ship new features.

They've also filed issues against uBlock Origin and other ad blockers, to get minor compatibility issues fixed.

This helps the ecosystem as a whole, it avoids fragmentation of the tooling.

Their entire ad block engine is open source.

Next talk at #pepr22! "Lyft and the California Consumer Privacy Act". This is pre-recorded.

(Apparently "at scale" means "large companies getting their talking points out". We're not going to transcribe this corporate propaganda that the talk is opening with, we'll wait to see if it says anything real.)

(We stand in solidarity with rideshare drivers, who we know for a fact have an entirely different perspective on the loans and other stuff that these talking points are bragging about.)

They worked on export and deletion for CCPA. (Is this even an engineer talking? The level of enthusiasm in the pre-recorded talk is... very high.)

Okay, *now* it's an engineer talking, they handed off. Still pre-recorded. Wow. Don't let your marketing department write your talks for technical conferences, people.

Lyft has presented at PEPR in previous years and it was much more sincere and detailed than this. What happened?

They can't do one-size-fits-all solutions, because there are a lot of team at the company that do different things. This is the only thing of substance in the talk so far.

They use a shared responsibility model, federated across each product team's services. The central infrastructure has a state machine to track what the product team pipelines are doing.

Here's an architecture diagram, showing that. The request from the central infrastructure can be to extract data, or to delete it.

Users have a web front end to initiate these requests.

They designed the system to be robust against bugs on the product team's side.

Product teams have metadata annotations.

They have privacy checks in one of their pipelines (it's not clear which one) to make sure the annotations aren't skipped.

They build libraries that product teams can use to make this work easier.

There's a ton of marketing copy on these slides. Other than that one architecture diagram, everything is generalities, no technical details.

(Of course these topics are legally sensitive and corporations are cautious about discussing them, but that's the whole concept of the conference! Everyone else managed to get their employers to understand that.)

(Don't just talk about the high level goals you want to achieve! We all know those goals, we all have the same legal requirements. Talk about how you did it!)

(We strongly suspect this was out of the hands of the engineers; this complaint is directed at whichever people in management made the call.)

Because data may be in the middle of being copied from one place to another, they need to do at-least-once deletion guarantees, not exactly-once.

Now it cuts back to the extremely enthusiastic person, who has a self-congratulatory slide about how they launched it and it was on time.

"Shift left".

Sorry for all that editorializing, but we're here for technical content or academic research, we're not here to repeat marketing crap.

We have spent a lot of time getting in touch with our innermost feelings, and one of our innermost feelings is that we have no interest in being a mouthpiece for corporate bullshit.

Q: Does this deletion system apply to both driver and passenger data?

A: We have a great many constructs around users because we have so many lines of business. This is a good question.

They did end up using the same infrastructure for all this.

Q: How do you deal with data that wasn't structured to be deleted?

A: We not only have deletion on request, we also have finite retention, so there are other ways to delete.

(That was a good question, it would have gotten at some of the real challenges here. A shame to get such a short answer.)

Okay! Now for the last talk before lunch at #pepr22, "Automating Product Deprecation" by Will Shackleton on behalf of Facebook. Pre-recorded.

In 2015 they launched a photo sharing app, Moments. In 2019 they shut it down. What did that entail?

Notify users, let them download their data, then delete it.

(Will *is* an engineer, by the way, and is talking like a real person.)

User data is created ... by users. There are many cases when users might expect it to be deleted. When they delete it themselves, when they remove their accounts... and also when the product is turned off.

Product deprecation is difficult, it happens at the end! You may not have the developers who built things four years ago on hand to help with it.

Without privacy guidance, product teams might be inclined to leave data in place in case they ever turn it back on.

It's also possible, during deprecation, to accidentally miss a table and forget to delete it, or to delete something that's still being used.

Deleting something that's being used in prod could break the site.

They have internal process and tools for how to do this.

Removing code and data is an engineering problem with an engineering solution, which will be the focus today.

In many cases they can do it automatically. They have a system that cleans up unused, "dead" data for many of their data stores.

For each type of data, it determines whether the data is in use and removes it if not. This happens without engineer intervention or authorization, but does allow engineers to interact with it.

They measure both static usage (in code) and runtime usage (traffic).

When it detects that something is unused, it notifies the engineering team via an internal ticket.

It doesn't wait for acknowledgement. After a month it blocks all reads and writes, even if the engineering team has ignored it.

After a second month, it deletes everything. Then it verifies that it was done and keeps an audit log.

Engineers can also initiate it manually, and skip the waiting periods. If they think the removal is incorrect, they can escalate.

They have a second system that cleans up dead code - otherwise dead code could keep data alive.

The static analysis works in multiple languages. There are some unique challenges in dynamic languages, such as Python, where the type checker might not be able to know what the code is doing.

The analysis also records a human-readable explanation of how it knows the code is dead.

It then submits a code change for engineers to review, with this explanation.

They have committed hundreds of thousands of changes this way. When false positives happen, they triage these and improve the analysis.

In cases where the conclusion is highly confident, they skip human review and commit directly.

Product features are interconnected - for example, Moments photos could be linked to Facebook profiles. So even with this stuff, there's challenges automation can't solve.

Problems: Dependencies across products; 100% coverage is hard; stuff gets hit by researchers or broken apps even when it's not in real use; there are false negatives.

They might have unrelated code with the same name.

They've built a deprecation tool that helps engineers understand these steps and leverage the automation.

The tool gathers data from the cleanup systems.

It helps engineers understand what's blocking automation from proceeding, and how to get it moving.

Engineers can assess things like whether requests to an endpoint represent real user activity, and instruct the automation to proceed.

Q&A time, but we missed the question. A: we found the usage thing was so accurate that when engineers thought it was wrong, they were usually mistaken.

Q: when you use tooling rather than automated scanning, what are the incentives for developers to do that work?

A: it's a thing we're working on at the moment. we've found people tend to switch to another team... there's a carrot and stick philosophy, don't want organizationally force it, want to help people understand the impact. it saves compute capacity.

Q: any instances where it went wrong?
A: they are extremely careful with dead data. the period where it's just blocking reads/writes has helped them catch bugs.

it's never resulted inappropriate deletion that they couldn't recover from.

*in

okay! now for a lunch break until 1:30 Pacific. we'll be back!

Okay! Back to #pepr22, with a block of talks on designing for privacy. This next one is "Identifying Personal Data by Fusing Elasticsearch with Neural Networks", by Rakshit Wadhwa and Ryan Turner, from Twitter. Another pre-recorded talk.

Rakshit's a software engineer and Ryan's an ML researcher.

They have a microservice architecture, which means data is spread out across many teams and services.

They tag their data columns with annotations from a standard taxonomy they've designed.

Data storage systems can be heterogenous.

They want to understand what data they have, how sensitive it is, and what it's used for. They also want to optimize storage and improve security.

It would be ideal to use a taxonomy right in the schema, but lots of their data sets already exist...

Instead, they add optional annotations, as an ongoing project, to the existing schemas.

They built a probabilistic model which recommends annotations.

They have some example data columns. One is named "id" and the system suggests "UserId" as an annotation.

Annotation recommendations speed up development, and help data quality.

Their taxonomy has over 500 "personal data types". They have over two million fields in need of annotating.

The recommendation system creates suggested annotations; the final annotation comes from humans.

They have linters and other integrations to make this easier to use.

Now Ryan's talking about the ML parts of this.

The training corpus is manual annotations. They have about 70,000 of these.

They picked only the ones they trust to be accurate.

They augment the descriptions with metadata, which the model sees as a feature vector of unstructured text.

Some of these fields are drawn from a closed set, the taxonomy, while others are freeform.

They wanted to leverage existing systems, so they chose to reduce it to a full-text search problem.

Elasticsearch is a widely known system for retrieving documents.

Every annotation needs to become a "document", and then they do queries... so they concatenate training examples into documents.

Then they needed to train a model that converts these confidence scores into a classification.

The scores that Elasticsearch outputs are uncalibrated, so that's why an extra step to turn it into a probability is needed.

To train the calibration model, they held out 20% of the training corpus.

Otherwise they'd be recycling the same data used to build the index, which would over-fit.

Because they do multiple Elasticsearch queries, the calibration model also has to combine multiple scores.

They did some statistics comparing different types of model.

They looked at how often the true answer is in the top 10 recommendations. They also looked at how accurate the confidence score is.

They found that when they deployed it internally, 73% of their data sets wound up using the system's recommendations, at least in part.

Although some engineers reported back that they didn't feel the need for recommendations, the majority did.

Q: We've had other talks about regular expressions for this task. How does this approach compare?

A: The goal here was to minimize maintenance cost over time, especially as they keep adding new annotations.

Q: How do you decide the appropriate granularity for annotations?

A: They weren't just looking at privacy, also at data discovery. They are working on a hierarchical model.

Q: It seems you're training based on annotations within the schema. Is it useful to also use the field contents as training data?

A: It would be, but we didn't. There are complications; the system then needs to have access to the data...

Q: How do you measure the quality of the ground truth data?

A: We did some manual checking to validate. Eventually the users confirm all the annotations.

Next talk at #pepr22! "Differentially Private Algorithms for 2020 Decennial Census Detailed DHC Race and Ethnicity", by Samuel Haney and Rachel Marks.

This has been a topic with a great deal of public attention; we're looking forward to the talk.

Rachel is part of the U.S. Census Bureau, and Samuel is part of Tumult Labs which has been advising them.

They're trying to balance confidentiality requirements against the things people need the data for.

They are planning to provide more detail than ever before, but at fewer geographical levels of granularity.

This will provide data on specific ethnic groups. They are also looking at "sex by age" (oh, do we have thoughts as a trans person).

They propose to do this for at the nation, state, and county levels, and for geographic areas defined by First Nations (our term not hers, it was an acronym we didn't catch).

They've had public feedback that population counts, sex by age, and race and ethnic data are important, so they're prioritizing these types of data.

They are continuing to incorporate feedback. There's a roadmap slide which says they will have a prototype demonstration data product, based on 2010 census data, at some point in 2023.

Now Sam is speaking about the differential privacy algorithm which they developed for this.

Disclaimer: The census hasn't officially chosen to use this algorithm yet, the numbers here are only for illustration.

Privacy-utility negotiation takes time and effort.

They started by figuring out their fitness for use requirements. They designed a rough algorithm, then identified parameters that would need tuning.

They went back and forth with stakeholders and iterated on all this.

Here's some example data! "North Carolina, Egyptian Alone", "Orange County, Japanese Alone or in Combo", "California, Aleut Alone or in Combo". One table per population group, geography crossed with detailed race or ethnicity.

In each of these tables they have "age by sex" breakouts, among others.

The most basic algorithm would add discrete Laplace noise to each statistic, scaled by sensitivity, defined as (max geographies per record) * (max ethnic groups per record) * (max stats each record contributes to).

First optimization to this: Adjust the privacy budget separately for separate race/ethnicity and geography levels.

This by itself doesn't reduce error, but allows them to tune more. We care about *relative* error, so we can give more privacy budget to groups we expect to be smaller, and less to groups we expect to be larger.

With stakeholder input, they can also identify groups that need more privacy budget, such as the nation and state counts.

Second optimization: Adaptively choose the stats level, using a fraction of the privacy budget. So they define four thresholds - total, sex*age (4 buckets), sex*age (9 buckets), sex*age (23 buckets).

They spend some of the privacy loss budget to get a noisy size of the group (sample group: Europeans in California). They use the noisy size to choose which bucket to put the group in.

Third optimization: Use the discrete Gaussian mechanism and zCDP privacy accounting. They will give a brief overview but it's very detail heavy.

Gaussian is similar to Laplace but with noise from a different distribution. This does not satisfy pure differential privacy for any epsilon, but it does satisfy "zero-concentrated differential privacy".

This is an alternate definition of differential privacy which performs well when we're composing many queries.

If we fix the margin of error, the privacy loss under zCDP with discrete Gaussian is moderately better compared to pure DP with Laplacian.

They chose a very small delta, so they think this is comparable.

Q: Did you need to worry about the consistency of stats across levels?

A: We haven't but there's been ongoing discussion about whether we should. The algorithm's current version doesn't but the census may post-process to produce it.

Q: Why discrete statistics instead of releasing confidence intervals? Did the consumer not want that?

A: Since it's a full enumeration we try not to release confidence intervals. We want the data to be as accurate as possible, and our data users aren't familiar with that.

Q: You say demand for the data was high. How is demand measured?

A: This is the only place people can get data on many of these groups. We can see from previous censuses how many people access it. We've also asked via the Federal Register.

People told us they really need the population counts for this disaggregated data. It's not enough to know that in LA County there are a certain number of Asian people, they need groups within that.

Q: Did you say Congress doesn't understand confidence intervals? Then how were they convinced to use differential privacy?

A: I can't answer that! Maybe Rachel?

The Census Bureau has an obligation in title 13 to protect people's confidentiality. We've been learning about it and are moving to it to make sure we are protecting the data we collect from respondents. The law says this.

Last presentation of the block at #pepr22 is by Dr. Rebecca Bilbro, "Data Structures for Data Privacy: Lessons Learned in Production".

A live, remote talk.

(re: the previous talk, a friend wrote some references that we're pasting, via the Slack desfontain.es/privacy/gaussi… )

( desfontain.es/privacy/renyi-… )

To contextualize what Dr. Bilbro's research team is doing, she's providing some background.

This is for a specific client project; the code is open source and linked from the slides.

The "travel rule". This is a rule originally published in 1996 by the Financial Action Task Force, an inter-governmental body fighting money laundering and terrorism.

In response to cryptocurrency, in 2019 the FATF created new requirements for record keeping about the participants in those transactions.

There have been several information transfer protocols developed in response to this shift.

The solution they built is called TRISA. It sits at the intersection of financial crime regulation, distributed ledger technology, and ... user privacy.

"In a lot of ways, those three are mutually exclusive."

The protocol is a peer to peer network, with the peers not being individuals, but "virtual asset service providers" (VASPs). We think this roughly means Coinbase and its ilk.

These requirements apply to monetary transactions above some financial threshold.

The unique thing about TRISA is that it's built on top of gRPC and protocol buffers (Google's RPC technology).

In addition to being more compact than JSON, "there is no way to parse out the fields that are part of a protocol buffer" without the schema. (We have background in this and would mildly disagree, but that's fine.)

All the connections are protected via mutual TLS. That is, both the sender and the recipient have certificates.

To do this, you need public key infrastructure.

There's a global TRISA directory service which acts as a certificate authority and list of endpoints and public keys.

Both institutions, sending and receiving, look the other up in that service before proceeding.

Within the TLS, they transfer the "secure envelope", which is intended to secure "public individual data" such as legal names, government ID numbers, home addresses, date of birth, account numbers... It's a lot of potentially harmful information.

They're using IVMS 101 to describe these identities - that's an international standard to make the names match up and stuff.

There's a diagram here showing which parts of it are encrypted with which keys.

Discussion on the Slack agrees with our caveats that protocol buffers are not hard to reverse engineer.

Also, people are noting that the schema may not even be secret here, since this is an open protocol...

That doesn't take away from the rest of this architecture, it's just something to note.

Lessons learned: They've had uneven adoption, which makes it hard to implement.

Canada, Germany, Singapore, Taiwan, the Philippines, the Czech Republic, and the United States have moved the fastest on making the travel rule law.

The local laws, however, may differ in details. Different monetary thresholds, different minimum criteria about the information to include.

The speaker notes that "There is weird tension in the web3 community between decentralization and centralization."

TRISA is peer to peer *but* the certificate authority and directory service are centralized.

There are competing protocols, but they don't have a way to verify people's claims about who they are...

The directory service doesn't consist of a lot of data, but it must be geographically distributed to avoid unfairly privileging VASPs located near to it.

The travel rule may also conflict with other rules, such as GDPR.

Not everyone agrees on import and export of cryptography, either...

They're happy about mTLS and gRPC, but they've had to help client-side developers work with it since it's not widely used yet.

Overall: Be flexible.

Q: How does the system prevent the encrypted messages being cracked later? The stability of the data in the containers means it could be brute forced years from now...

A: That's the more complicated version of how do we trust the VASPs are storing data appropriately. We can't. We can just educate people about it.

Q: Given you can't stop the data from being decrypted, this isn't secure under (couldn't hear) assumptions. How does that match up to the trust model?

A: This is going to sound strange, but cryptography isn't very well understood in the crypto community.

Huge applause and laughter from the audience.

(She's way more generous towards that community than we are, just to say that.)

The conference is now on break.

Okay! The conference is back from break. Next up is a panel, "Fueling the Emerging Privacy Tech Industry", with Lourdes M. Turrecha from The Rise of Privacy Tech, and Nishant Bhajaria, from Uber. #pepr22

Lourdes is talking about a cross-functional working group she was part of putting together.

How to automate and scale privacy practices?

Unless you think about privacy as something that benefits customers, it's hard to get institutional buy-in (in the for-profit world at least...)

How do we break down the silos and bridge the gaps between privacy engineers, lawyers, founders, investors?

The Rise of Privacy Tech wrote a white paper last year, theriseofprivacytech.com aimed at creating a space to bridge the gap.

People have historically talked past each other about this stuff.

TROPT is an industry hub, a place to make cross-functional connections.

They define privacy tech as either "technological solutions to privacy problems" or as companies that build those solutions.

They accounted for business model. Most startups they see are B2B but they also see B2B2C and B2C.

They look at the data lifecycle, the development lifecycle, and adjacent industries.

They categorize products. For example, within the data lifecycle, they have several subcategories.

They also have subcategories under development lifecycle.

There's a lot to be done to update this and make it more mature.

The challenge begins even before collecting data. If you are a founder or VC, you look through the lens of growth and quick engagement.

When people say "shift left" for security, they mean catch threats early. With privacy, "shift left" means even earlier. Security implies an adversarial mindset, but from privacy you have to look at the data itself.

You need this tooling and training in place well before data collection.

With regard to training stakeholders and execs, they send spoof emails...

(Don't do this!!! This is a terrible practice!!!! That's our personal opinion.)

Apparently the benefit they see for the spoof emails is to start conversations and build awareness. Great, but it's also hostile to workers and often indistinguishable from real emails, so there's nothing actionable about it.

Sorry, that criticism is ours, not the panelists'

The panelist continues: Privacy engineers who have expertise in all of these different areas can have a lot of influence on a company because they can speak to more people than other types of software engineer can.

Nobody would buy a car without seat belts and airbags. Similarly, understanding privacy tech is both the right thing to do and something valuable.

(We're personally somewhat allergic to framing our work in terms of value to capitalism, but we acknowledge that that value exists.)

Make a compliance-based argument, but make it your second argument not your first. People will tune it out if it's the only argument.

Compliance is a stick, not a carrot. "Let's just face it", laws will always lag behind tech.

(We would personally challenge that, we believe that policymakers need to be futurists, and start skating to where the puck will be, not where it was.)

(We very much think that's possible. We do understand why people from an industry perspective don't expect it.)

How do you build the right solution, how do you explain that? Nishant wrote a book to answer that.

(The later in the day it gets the more of our opinions you get to hear, apparently :))

You don't have to be a privacy engineer to do the training, you can be somebody who helps, adjacent to that.

Be frank about what's promising and what's ineffective, when evaluating products.

They've also started doing angel investing in some of these startups.

Don't spend twenty minutes explaining GDPR or CCPA in a pitch meeting, get to the value proposition quickly.

"would this be a world that we want to raise our kids in?" behind all the dashboards and alerts is a human being, and they don't know you exist. remember them.

lots of discussion on Slack about values

now #pepr22 moves on to the final block of three talks, on the subject of privacy case studies.

next talk at #pepr22 is Tatiana Ringenberg and Lorraine Kisselburgh, "A Way Forward: What We Know (or Not) about CSAM & Privacy"

these are academics, researching the social implications of emerging technologies

[CW discussion of child sexual abuse for the remainder of this talk]

Over the past year, Apple has introduced three forms of protections for children.

First: parental controls which give warnings for images with nudity.

Second: guidance for Siri, Spotlight, and Safari search.

Third, and most controversial: the NeuralHash function which scanned content on users' devices for known CSAM.

this algorithm was delayed after public criticism, and has been removed from their site

researchers expressed concern about NeuralHash from political, technical, and social perspectives. it sets a precedent for law enforcement and runs counter to Apple's privacy stance; ...

it creates the potential for adversarial collisions. it relies on an external database.

it focuses on end users rather than on the criminals producing this material. it has a negative impact on victims.

what can we do to address these implications? there's significant focus on identifying and prosecuting the perpetrators of these crimes, but this technology triggers privacy concerns for survivors

the researchers call for design practices that consider privacy ethics, and include stakeholders such as survivors

the ACM code of ethics can be leveraged to analyze this risk

the code of ethics includes two overarching principles. "avoid harm", which mentions "well-intended actions" that "may lead to harm" and that those responsible are obliged to undo or mitigate it.

... and "respect privacy", which states that the responsibility applies to computing professionals, who should only use personal information without violating the rights of individuals.

in 2018 the ACM US Technology Policy Committee published ten principles

fairness, accountability, transparency etc. these have been endorsed by international governments.

transparency: give clear information about how personal data is collected, to whom it may be disclosed, etc

in CSAM, such disclosures may occur in an environment where sharing the content itself is harmful - for example, teens in abusive environments could face harm as a result of the disclosure

individual control: consent, limited collection, etc.

in CSAM they proposed to notify parents, which creates violations of privacy

data security: protect personal data against loss and misuse.

CSAM: illustrates how authoritarian regimes can take advantage of the mere act of creating and storing these databases, depending on who is given authorized access

accountability and risk management: independent audits, assess privacy risks.

CSAM: oversight must be part of the design process, impact assessments are important tools as long as they include the expertise and participation of survivors and victim advocates.

recommendations: balance privacy of survivors with surveillance concerns. create appropriate oversight practices which include survivors, advocates, social scientists, etc as stakeholders. recenter individual autonomy.

move away from an enforcement model to a risk-based model.

Q: how can companies implement participatory design?

A: companies are approached by law enforcement, who seek to identify specific acts, with a very narrow focus, and prosecute those people.

participatory design means reaching out not only to enforcement, but to affected communities. enforcement is not victim advocacy, and is not focused on minimization of harm or on prevention.

if you work only with law enforcement, you can end up with a system which does not mitigate risks or harms.

Q: how to approach the dangers to children in a safe way?

A: include scholars, sociologists, people with expertise in child abuse. there is a large community around this beyond just law enforcement.

do holistic risk analysis. not just "what will happen if the system works" but also "what will happen if somebody breaks into the system", "what happens if somebody abuses the system" (theocratic regimes), etc.

lots of discussion on Slack!

[end of CW, we're moving on to the next talk now]

next talk at #pepr22: "Privacy Firefighting: Incident Management Lessons from (Literal) Fires", Katie Hufker from Facebook

Katie, in addition to being a software engineer, is also a volunteer firefighter

What is a privacy incident? Something is broken, we are violating either an internal policy or an external requirement

We treat vulnerabilities the same as active incidents, we don't always know if they're being exploited.

In a fire, even young children are taught to dial 911 (in the US). They reach a dispatcher who gathers information and follows a pre-plan to send out resources.

They send tones over the radio to do dispatch, loud enough to wake people who are sleeping

in privacy... external and internal need ways to flag issues. then there need to be calibrated people who can triage an incident and pull in help. they have lists of current oncalls, and they have robocall tooling

they prefer false positives over false negatives

they had a 911 call about a bird stuck on a telephone pole - the bird flew away before responders arrived

once you have an incident, figure out the extent of it

in a fire, when you get on scene, do a "sizeup" to assess what you have. single-story building, how much smoke, that sort of thing

also look for other risks such as downed power lines and nearby buildings

this is all communicated over the radio so incoming units know what's going on

in privacy, make sure the initial report makes it clear what's going on. do a virtual walkaround to assess the impact

consider turning off the feature until you can mitigate the issue

make sure to post on a centralized incident thread to keep people up to date. on large incidents, consider splitting into workstreams.

in fire service, every twenty minutes on scene they do status checks to make sure everyone's alive. these are mandatory - if you don't respond they go looking for you

use plain language, there's no point if not everybody understands it

in privacy, make sure to post regular updates. avoid abbreviations.

if you are too busy to do this, pull in more people.

for larger incidents you may need to work with non-eng teams, such as comms

after the incident is over, you need to learn from it. fire service could do better - they file reports but formal review won't be much

privacy incidents also involve reports. look over it for a few key things: detection, escalation, remediation, prevention.

in summary: have a plan beforehand. communicate during the incident. afterward, reflect.

the speaker was wearing a fire helmet throughout the talk, the prop really cheered people up

Q: what if the incident is emblematic of a fundamental gap? how do you update building codes?

A: that's where I really appreciate getting everyone in a room and talking it through. you won't be able to fix it overnight.

final talk of #pepr22!!! "Privacy and Respectful Discourse in AI Chatbots", by Jayati Dev

she's doing research on privacy-centered design for internet of things devices

what do we mean by chat bots? bots which people engage with socially. we've seen a 35% hike in usage during the pandemic, since people were socially isolated

this can include voice assistants

an org called Delta was breached, exposing financial information about their customers

bots can apply manipulative tactics to encourage violent behavior, especially among young users

they focused on data types. did users share sensitive information about themselves?

they sampled 2019 Cleverbot data, reading 37 chat logs. users used words and short phrases more than complete sentences.

they used GDPR data categories. people shared political affiliation, belief, religion, sexual orientation, age, gender, location, health data, ...

they didn't find anything around racial or ethnic origin, or trade union membership

they also didn't find biometric data

in the analysis of interactions, people may be joking, so it's hard to analyze the psychology, we can't give findings about emotion in the data

they saw instances of fostering manipulative relationships, polarizing discussions and assertions, comfort-driven sharing....

users tested the bot's boundaries. for example, a user asked if their conversation was being monitored by the FBI. (as if the bot knows...)

users also tried to get meaning by repeating themselves

there is a need to effectively flag sensitive data and interact with it in appropriate ways. for this "we need a repository of mental health markers and abuse indicators" (we're personally against this, unintended consequences, but we see the point)

[TW abuse and suicide] one user spoke about a partner being violent towards them.

(we're not quoting it here because our Twitter account would be suspended...)

bots need to be more transparent about how user input is handled, rather than providing confusing responses

for third party sharing there should be ways to opt out

the bot does state up front that users should not provide personal information, but this puts the onus on the user

conclusion: users disclose highly sensitive information to chat bots, and this should be considered in the design.

they want to do a systemic analysis of existing chatbots against these recommendations

they also want to look at harms that affect some demographics more than others

Q: what's the business model of these chat bots?

A: Cleverbot doesn't have a business purpose, it's "a way to talk to a friend, even if that's a bot"

Q: I'd love to have a test suite for chat bots for these scenarios. Should there be a certification or ISO standard?

A: privacy by design can help, as a starting point.

Q: do users see a privacy notice or know the implications of sensitive data sharing?

A: yes, they see a notice up front

there is more discussion in Slack

Q: how do we make sure detection tools themselves aren't misused within the company, for example to find mentally vulnerable people?

A: yeah, lots of discussion. the tools are not very efficient because the company is not transparent.

and that's it for all the talks!!! now Divya Sharma and Blase Ur will give #pepr22 closing remarks!

thanks to Lea Kissner and Laurie for starting the conference

thanks to volunteers for keeping track of all the questions people asked via Slack

thanks to Usenix staff for audivisual stuff and logistics, and ice cream!

thanks to sponsors (listed on the website if you want them)

very active Slack discussions, been great to see

the Slack will remain up and everyone is still welcome there

Divya: thanks for all the great energy and excitement. this was a great first hybrid conference experience!

we'll send out a call for volunteers for next year

"As you see from today's news, our work in privacy and respect matters now more than ever."

Lea Kissner: it's been wonderful!

thanks for coming, this event is based on sharing what we're doing but also discussing it and learning from each other.

thank the co-chairs, they did a huge amount of work

this is the first time the program committee wasn't just Lea and Laurie, so thanks to them!

(and... that's it! see you all next year!!!!!!!!)

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Irenes (many)

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @ireneista

Irenes (many)

Irenes (many)

Irenes (many)

Irenes (many)

Irenes (many)

Irenes (many)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?