Troy Hunt Profile picture
Apr 3, 2021 27 tweets 7 min read Read on X
I’ve had a heap of queries about this. I’m looking into it and yes, if it’s legit and suitable for @haveibeenpwned it’ll be searchable there shortly.
On first review, it's an extensive data set with one file per country and a header row as follows:

phone,uid,email,first_name,last_name,gender,date_registered,birthday,location,hometown,relationship_status,education_last_year,work,groups,pages,last_update,creation_time
I actually couldn't find any of my own or my family's data in the Australia file which has 7.3M rows. Having said that, I'm hearing from other trustworthy sources that the data is legit and that seems a reasonable assumption to work on for now.
Email addresses are *very* scarce though; in that 7.3M record Aussie file, there are only 47k occurrences of "@". The Italian file is the largest with nearly 36M records and there are 440k "@" chars in there. On that basis, there will be millions of addresses in the data set.
So, I'll extract those addresses, do some further verification then load the data. It won't be hundreds of millions of records, I suspect it'll be less than 10M, but obviously that's still a substantial number.
And no, I have no intention of adding phone number search in the foreseeable future. There's a User Voice suggestion for that and a comment from me which boils down to "much higher work and much lower value": haveibeenpwned.uservoice.com/forums/275398-…
I like the comment in this tweet. If we look at the data, email is rare, DoB is rare so the greatest impact here is the phone numbers. Even though it’s “only” 20% of FB users, the number is obviously substantial thus so is the impact
Another interesting data point on this: there are only 108 files with each representing a country therefore many countries are missing including Norway, Sweden, Denmark and Iceland, but Finland is in there. It's not clear why.
Here's the complete list of files in the corpus of data I was sent. If anyone has a different set, I'd be interested in hearing about it: gist.github.com/troyhunt/00b9a…
On closer inspection, all the files names are Italian. So Norway ("norvegia") is there as is Sweden ("svezia") and Denmark ("danimarca"). Sorry folks, tweeting as I go here.
Now that's clear, I'm finding a lot of friends from various places who've confirmed their exposed data. I haven't seen anything yet to suggest this breach isn't legit.
So what's the impact? For a targeted attack where you know someone's name and country, it's great for mobile phone lookup. Much harder to do en masse as there's no reliable key; I couldn't take a big list of emails and resolve them to phone numbers as email is rare in the data.
But for spam based on using phone number alone, it's gold. Not just SMS, there are heaps of services that just require a phone number these days and now there's hundreds of millions of them conveniently categorised by country with nice mail merge fields like name and gender.
Should the FB phone numbers be searchable in @haveibeenpwned? I’m thinking through the pros and cons in terms of the value it adds to impacted people versus the risk presented if it’s used to help resolve numbers to identities (you’d still need the source data to do that).
Factors influencing my consideration of this: only about 1% of the records have email addresses, the phone numbers are easily parsed (they’re in a CSV) and they’re formatted complete with country code. It’s a very clean data set and is 100x more useful than email in this case.
Another general observation on this incident: I'm seeing *extensive* sharing of the data, both the entire corpus of countries and individual country files. Not just in hacking circles, but very broadly on social media too. This data is everywhere already.
Email parsing now done, found 2,529,621 unique addresses across the 108 files. Call it about 0.5% of all records having an email address.
That’s the email addresses loaded, I’m still considering what to do with the phone numbers
I’m seeing a lot of anecdotal reports that people have received a marked uptick in spam calls and SMSs aligning with this incident. It’s very hard to attribute though; I get a heap of those too and my number isn’t in the data. That said, I expect the data will be abused.
I see a lot of questions like this one from @Blessf11 and it’s always the same answer: the service that suffered the breach should provide the data that is circulating publicly to the rightful owner of it. FB, of all companies, has the resources to do this
After doing rather a large lot of processing and discovering 370M rows in the data set I was given some weeks ago then wondering why the headlines read 533M... I've been sent a separate set of files. This set aligns with more recent reporting: gist.github.com/troyhunt/9a081…
Which means that now I need to figure out the gaps and if it impacts the email addresses already loaded into @haveibeenpwned. It'll *definitely* impact the phone numbers, if I decide to load them.
Much of the data is same same but different; Albania, for example, begins with the same phone numbers and FB IDs but the original data was CSV whilst this lot is a colon delimited text file with a different field order.
This is really kludgy; 2nd data set has nowhere near the consistency of the 1st with colon delimiters, comma delimiters, headers, no headers, quote encapsulation, no quote encapsulation, different field orders, + before num, no + before num. Hackers have no attention to detail!
The problem with this whole situation is that in a vacuum of information, people speculate. Facebook needs to make a clear statement on the data that’s in broad circulation; when it happened, where it came from and what’s in it. Without that, confusion and speculation reign
The Facebook phone numbers are now being loaded into @haveibeenpwned and will be searchable later today. Stay tuned, I'll push out a short blog once it's good to go (will be queryable via the existing API too 😎).
Statement from Facebook on this incident: “Scraping data using features meant to help people violates our terms”. Well that fixes that! about.fb.com/news/2021/04/f…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Troy Hunt

Troy Hunt Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @troyhunt

Mar 13
Working with @Cloudflare pages is so cool, check out this workflow:
We have an open source repo for @haveibeenpwned's ux-rebuild which is here: github.com/HaveIBeenPwned/
Our front end oompa loompa just submitted a PR in the "privacy-page" branch: github.com/HaveIBeenPwned…
Read 7 tweets
Jan 2
The Pornhub story regarding age verification shows just how hard privacy-preserving identifying verification is. Even when everyone agrees on the sentiment (nobody is saying kids should have access to porn), there’s no consensus on the execution. 404media.co/pornhub-is-now…Image
Image
It took me a few seconds to VPN into Texas and capture these screens. It takes someone in Texas a few seconds to VPN into California and *not* see these screens! It costs a few bucks a month for a good VPN with loads of exit nodes around the world, placing you where you want.
I suspect that factored into Pornhub’s decision - the knowledge that they can satisfy a state law whilst not posing any real barrier to paying customers. If someone is willing to pay for porn, surely they’re willing to pay a lot less for a VPN to access it?
Read 7 tweets
Oct 25, 2024
Was confused whilst doing my live stream just now why there was a sudden spike in DB usage on @haveibeenpwned. Turns out it was related to *dropping* this constraint:
ALTER TABLE [dbo].[Domain] ADD CONSTRAINT [CHK_DomainName_Pattern] CHECK (([dbo].[IsDomainValid]([DomainName])=(1)))
We'd decided a constraint that calls a function on every insert of a new domain was unnecessary; all it did was validate that the string adhered to the correct pattern, but because we controlled the upstream code, we could do that before it even hit the DB.
Read 5 tweets
Oct 9, 2024
Hi folks, yes, I'm aware of this. I've been in communication with the Internet Archive over the last few days re the data breach, didn't know the site was defaced until people started flagging it with me just now. More soon.
Looks like someone compromised a polyfill JS file on a subdomain to inject the alert, but that doesn't explain the root site being down
Looks like a combination of things with the site being DDoS'd as well:
Read 9 tweets
Oct 8, 2024
This was a very uncomfortable breach to process for reasons that should be obvious from @josephfcox's article. Let me add some more "colour" based on what I found:
Ostensibly, the service enables you to create an AI "companion" (which, based on the data, is almost always a "girlfriend"), by describing how you'd like them to appear and behave: Image
Buying a membership upgrades capabilities: Image
Read 21 tweets
Sep 25, 2024
Another cool little @Cloudflare thing that snuck out recently is this very simple security.txt creator: Image
It's a simple form-based configuration that takes the basics of a security.txt file in the following interface: Image
Because @cloudflare sits in the middle of the traffic, they can then intercept requests to the appropriate path and serve up the file. Here's one I just created: troyhuntsucks.com/.well-known/se…
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(