It's time to kick off an entire session about data deletion at #PEPR21 (It's hard!) with "Deletion Framework: How Facebook Upholds its Commitments Towards Data Deletion" from Benoît Reitz, Facebook

That's right, come one come all, this is @Facebook' data deletion framework.
We can't expect people to write their own data deletion logic.
* They often don't know how to do it well and write bugs
* The deletion logic and data definition may drift apart over time

So we get annotations that people put on their storage
Annotations example. There are multiple different types of edges, like "deep" saying when a post is deleted, the comments should also be deleted.

If it's a "shallow" edge, it should delete the association but not the object (e.g. a post is deleted but not the whole user)
We take the annotations and generate code
* Deletion code written by storage team
* Other deletion logic is entirely around following edges and telling storage to delete
How do we walk the graph to delete?
* DFS
* we need to be re-entrant (can restart if something crashes so we don't lose data)
* consistent
[ There are many diagrams here showing what happens in the persistent stack and the logs as the deletion actions happen see the video. ]
[ It's very fancy ]
[ But I also spilled tea on my keyboard ]
Scheduling deletions in the future
* enable ephemerality -- e.g. "stories should go away in N hours" or "when you delete your account it should be hard-deleted in 30 days" or "financial data should go away in N years"
* supports custom TTL logic
* processes 160B events/day
[ Diagram showing how these move through the system, including retries]
Scheduling
* deletions run in two phases: sync (synchronously in the web request) and async (offline)
* sync is used to hide the rest of the subgraph quickly. Async does the full cleanup more slowly
For example, sync delete the post. The comments say "you can read the comments if you can read the post". Post is gone, so no one can read the comments.

Can't delete everything sync because it's too much, can't guarantee. [ retries are a thing! And backups! ]
Eventual completion
* deletions encounter infra issues
* also bugs
* might get dropped by dependencies
* or halfway through scheduling
etc.
So completion of deletions is monitored [I left a question about monitoring of data store directly]
If there's an issue you can get orphaned data, need to clean that up retroactively. How: object re-deletion. Look for edges that come from objects that don't exist any more and try deleting again.
Restoration: we log restoration logs before any delete we issue. Just in case there's a bug somewhere in the deletion system, you can fix it.

Graph indexed, different from the data store backups, can restore just a subgraph. Encrypt the backups with a key and throw away the key.
Preventing data loss
* static analysis on the deletion graph
* dynamic checks based on deletion constraints
* predictions on the edges' deletion behaviours
Happy path: measure that reliability
Measure how much goes through the gaps
Then measure the reliability of the safety net

Bonus points for an orthogonal way to measure success and a second safety net
Conclusion
* Deletion is a hard problem
* The happy path isn't enough. You need safety nets for the failures.
* We need to make it easier for developers to do the right thing than the wrong things.
Q: This relies on a graph structure to work. What happens when FB acquires a company whose data isn't a graph?
A: Slowly push them to move towards our system and build the integration so we can ship deletion to them.
In the meantime we have a verification framework for everything which isn't covered by our deletion framework.
Q: How long does this deletion system take to run?
A: As of last week, 60% within one day of the scheduled deletion, 4 9s by 14 days [99.99%] (but we have 90 days)
Q: How about data to train ML models? What if it's in a slightly different shape somewhere else?
A: Train them often so you don't have to keep copies of data around, but I'm not the expert on this.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lea Kissner

Lea Kissner Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @LeaKissner

10 Jun
It's time to talk about consent at #PEPR21 starting with "Designing Meaningful Privacy Choice Experiences for Users" by Yuanyuan Feng, Carnegie Mellon and Yaxing Yao, University of Maryland
Notice and choice is a legal framework. There are privacy notices which tell people about the practices. The controls let people have limited controls.

But in practice the controls are usually difficult to find, overly simplified, and sometimes manipulative using "dark patterns"
Dark patterns manipulate people into making choices they might not otherwise make. For example, the terms/policy are linked in tiny type and there's only one button: sign up. Any choices are hidden behind this, which is suboptimal.

Or the pre-selected options may not be good.
Read 14 tweets
10 Jun
Next up at #PEPR21: Cryptographic Privacy-Enhancing Technologies in the Post-Schrems II Era

from Sunny Seon Kang, Data Privacy Attorney
Going to provide context on CJEU case C-311/18, aka "Shrems II"

This launched companies into a whole tizzy because they said that folks needed "supplementary measures". What the heck is that?
Without Privacy Shield, you can't transfer data from the EU to the US (thanks, Shrems I), because the US isn't considered to have "adequacy" [essentially strong enough protections under the law. People were pissed about Snowden]
Read 16 tweets
10 Jun
First up at #PEPR21 "Privacy for Infrastructure: Addressing Privacy at the Root" by Joshua O’Madadhain and Gary Young from @Google.

Because hey, privacy is a full-stack problem, from humans and the societies they build all the way down to the hardware. Infrastructure is key.
Both Josh and Gary have been at Google for "a while" (I think that's about 15 years each) and are both wizzes when it comes to privacy, especially in infrastructure.

Infrastructure is systems that provide other systems or products with capabilities [not the security kind]
Types:
* storage systems
* network systems
* data processing systems
* server frameworks
* libraries
* system integrations
* etc.
Read 29 tweets
3 May
More and more folks want to hire privacy engineers. This is great! You almost certainly need them! But, just like security, privacy engineering is a whole field.

So for the folks who want to hire or become a privacy engineer, a rundown of the current rough types I see. (Big🧵)
First off, let's talk about the two things that people want out of a privacy engineer: (1) privacy-respecting products and systems, (2) compliance.

Compliance is making sure that all the correct paperwork is filled out showing that you followed the rules. Here's the thing...
Compliance is necessarily reactive. It's responsive to failures of the past. If you're doing new things, then you're likely to hit new failure modes. For you, compliance isn't going to be sufficient. Because when things go really wrong, no one cares about paperwork.
Read 26 tweets
31 Mar
Most of us know about the Dunning-Kruger effect, where people who are clueless about a subject are also clueless about how clueless they are. I had not looked at the original study.

Part of it "tests" humour. According to the Cautionary Tales podcast, these are the test jokes:🧵
First off, I find it interesting that there's a "correct" answer. (It's #2, which I found, like many of you, to be too cruel to be funny.) But what I found more interesting is that they determined this "correct" answer by asking a panel of professional comedians.
The Dunning-Kruger study was published back in 1999. There's been an awful lot of change in what is considered funny. There's a lot less tolerance for punching down. Comedians from groups that many professional comedians thought were unfunny (e.g. women) are magically funny now.
Read 7 tweets
10 Mar
@anildash @natematias @ruchowdh @cfiesler FWIW, working with folks to build products and systems which are respectful of the lovely diversity of humans which exist is what I do. I've been lucky enough to work with a bunch of deeply ethical, thoughtful, and smart folks with a range of backgrounds and skillsets.
@anildash @natematias @ruchowdh @cfiesler I can talk about a bunch of things that I've done, places where you can see my work and that of folks like me, I can talk about PEPR, a conference for talking about this sort of work, but what I can't really talk about is the many things that never launched because of quiet chats
@anildash @natematias @ruchowdh @cfiesler Fundamentally, people want to build great systems and products. I try to help them understand that to get to greatness, you need to have respect built in -- folks I've worked with often come out feeling like they've built a better product and know how to design better.
Read 15 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(