Troy Hunt Profile picture
Apr 3, 2021 27 tweets 7 min read Read on X
I’ve had a heap of queries about this. I’m looking into it and yes, if it’s legit and suitable for @haveibeenpwned it’ll be searchable there shortly.
On first review, it's an extensive data set with one file per country and a header row as follows:

phone,uid,email,first_name,last_name,gender,date_registered,birthday,location,hometown,relationship_status,education_last_year,work,groups,pages,last_update,creation_time
I actually couldn't find any of my own or my family's data in the Australia file which has 7.3M rows. Having said that, I'm hearing from other trustworthy sources that the data is legit and that seems a reasonable assumption to work on for now.
Email addresses are *very* scarce though; in that 7.3M record Aussie file, there are only 47k occurrences of "@". The Italian file is the largest with nearly 36M records and there are 440k "@" chars in there. On that basis, there will be millions of addresses in the data set.
So, I'll extract those addresses, do some further verification then load the data. It won't be hundreds of millions of records, I suspect it'll be less than 10M, but obviously that's still a substantial number.
And no, I have no intention of adding phone number search in the foreseeable future. There's a User Voice suggestion for that and a comment from me which boils down to "much higher work and much lower value": haveibeenpwned.uservoice.com/forums/275398-…
I like the comment in this tweet. If we look at the data, email is rare, DoB is rare so the greatest impact here is the phone numbers. Even though it’s “only” 20% of FB users, the number is obviously substantial thus so is the impact
Another interesting data point on this: there are only 108 files with each representing a country therefore many countries are missing including Norway, Sweden, Denmark and Iceland, but Finland is in there. It's not clear why.
Here's the complete list of files in the corpus of data I was sent. If anyone has a different set, I'd be interested in hearing about it: gist.github.com/troyhunt/00b9a…
On closer inspection, all the files names are Italian. So Norway ("norvegia") is there as is Sweden ("svezia") and Denmark ("danimarca"). Sorry folks, tweeting as I go here.
Now that's clear, I'm finding a lot of friends from various places who've confirmed their exposed data. I haven't seen anything yet to suggest this breach isn't legit.
So what's the impact? For a targeted attack where you know someone's name and country, it's great for mobile phone lookup. Much harder to do en masse as there's no reliable key; I couldn't take a big list of emails and resolve them to phone numbers as email is rare in the data.
But for spam based on using phone number alone, it's gold. Not just SMS, there are heaps of services that just require a phone number these days and now there's hundreds of millions of them conveniently categorised by country with nice mail merge fields like name and gender.
Should the FB phone numbers be searchable in @haveibeenpwned? I’m thinking through the pros and cons in terms of the value it adds to impacted people versus the risk presented if it’s used to help resolve numbers to identities (you’d still need the source data to do that).
Factors influencing my consideration of this: only about 1% of the records have email addresses, the phone numbers are easily parsed (they’re in a CSV) and they’re formatted complete with country code. It’s a very clean data set and is 100x more useful than email in this case.
Another general observation on this incident: I'm seeing *extensive* sharing of the data, both the entire corpus of countries and individual country files. Not just in hacking circles, but very broadly on social media too. This data is everywhere already.
Email parsing now done, found 2,529,621 unique addresses across the 108 files. Call it about 0.5% of all records having an email address.
That’s the email addresses loaded, I’m still considering what to do with the phone numbers
I’m seeing a lot of anecdotal reports that people have received a marked uptick in spam calls and SMSs aligning with this incident. It’s very hard to attribute though; I get a heap of those too and my number isn’t in the data. That said, I expect the data will be abused.
I see a lot of questions like this one from @Blessf11 and it’s always the same answer: the service that suffered the breach should provide the data that is circulating publicly to the rightful owner of it. FB, of all companies, has the resources to do this
After doing rather a large lot of processing and discovering 370M rows in the data set I was given some weeks ago then wondering why the headlines read 533M... I've been sent a separate set of files. This set aligns with more recent reporting: gist.github.com/troyhunt/9a081…
Which means that now I need to figure out the gaps and if it impacts the email addresses already loaded into @haveibeenpwned. It'll *definitely* impact the phone numbers, if I decide to load them.
Much of the data is same same but different; Albania, for example, begins with the same phone numbers and FB IDs but the original data was CSV whilst this lot is a colon delimited text file with a different field order.
This is really kludgy; 2nd data set has nowhere near the consistency of the 1st with colon delimiters, comma delimiters, headers, no headers, quote encapsulation, no quote encapsulation, different field orders, + before num, no + before num. Hackers have no attention to detail!
The problem with this whole situation is that in a vacuum of information, people speculate. Facebook needs to make a clear statement on the data that’s in broad circulation; when it happened, where it came from and what’s in it. Without that, confusion and speculation reign
The Facebook phone numbers are now being loaded into @haveibeenpwned and will be searchable later today. Stay tuned, I'll push out a short blog once it's good to go (will be queryable via the existing API too 😎).
Statement from Facebook on this incident: “Scraping data using features meant to help people violates our terms”. Well that fixes that! about.fb.com/news/2021/04/f…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Troy Hunt

Troy Hunt Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @troyhunt

May 10
So this is an interesting one for several reasons. Firstly, the defacement which was obviously designed to antagonise a conservative media company. Maybe someone with an axe to grind, but definitely evidence of breach.
Then there are the 3 different classes of data set published at the bottom of the defacement, let's go through each by file name:
editors.json: this includes the name, personal email, phone and sometimes address of the journo. Given the politically charged nature of some of the content, PII exposure of this nature is extra concerning. It's now easy to match a story to someone's physical address and phone.
Read 19 tweets
Jan 31
Alright folks, this is starting to smell like bullshit. Not the alleged breach (which smells bad for reasons I'll explain in a moment), but the "AI" line from both Europcar and the PR agency that just emailed me pitching someone's hot take on it. Here's why:
Firstly on the legitimacy of the data, a bunch of things don't add up. The most obvious one is that the email addresses and usernames bear no resemblance to the corresponding people names. For example: Image
Next, each of those usernames is then the alias of the email address. What are the chances that *every single username* aligns with the email address? Low, very low.
Read 12 tweets
Oct 30, 2023
We often receive comments to the effect of “we want to purchase a @haveibeenpwned subscription but our company doesn’t allow us to use a credit card”. What is the financial reason behind this?

This is a very small portion compared to those that *do* pay by card, but why is this?
To add to this, having spent 14 years at Pfizer I’d see policies like this all the time. But it’s also not like there was a blanket ban: try going on a business trip and asking the person at the noodle shop you’re having lunch at to raise an invoice on 60 day terms 🤣
This also isn’t about traceability; spend the money, raise an expense claim with receipt, job done. I could understand if the answer was “because an invoice and wire transfer stops people randomly being stuff and puts procurement in control”, but they could still pay with a card.
Read 7 tweets
Sep 8, 2023
Let me add some more context to the Dymocks breach, starting with giving them a massive pat on the back for responding so quickly. It was less than 48 hours ago between me contacting someone there via LinkedIn and them having sent disclosure emails to customers. Massive kudos!
What's not as clear from the story is the extent to which the data was already circulating before I was able to get in touch with them. Multiple Telegram channels and a popular *clear web* (not dark web) forum were broadly circulating the data.
I also suspect we're about to see a repeat of the question so many people raised after Optus and Medibank: why do they still have my data? About a quarter of the rows are flagged "inactive" with dates as far back as 2005, yet still sit there with address, email, phone etc. Image
Read 4 tweets
Jun 16, 2023
Crikey Miele 🤦‍♂️ ImageImage
Ah, so that’s why. Up until 10 minutes ago… Image
I can’t setup my dishwasher because I can’t register in the app because the Miele “server” is down 😭 Image
Read 12 tweets
Jun 8, 2023
Had a weird thing happen with @AzureApiMgmt that caused the public @haveibeenpwned API to start getting laggy, especially around 1 week ago. It went from ~220ms response times 90 days ago to over 1 second up until yesterday. Scaled out an instance and now we're down to ~70ms. Image
This is despite very consistent performance of the underlying @AzureFunctions app. Something started gradually going south at the APIM level and I'm continuing to look at that with the team there. Image
What I'm a bit more interested in now is tackling this graph. This is "gateway errors", namely the reason APIM rejects requests. Exceeding the rate limit is number 1, but invalid subscription keys are massive too, plus there's an obvious hourly spikey pattern. Image
Read 19 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(