CyberChick Profile picture
Jul 31 16 tweets 6 min read Read on X
A major AI training data set contains millions of examples of personal data
Personally identifiable information has been found in DataComp CommonPool, one of the largest open-source data sets used to train image generation models.
Millions of images of passports, credit cards, birth certificates, and other documents containing personally identifiable information are likely included in one of the biggest open-source AI training sets, new research has found.
Thousands of images—including identifiable faces—were found in a small subset of DataComp CommonPool, a major AI training set for image generation scraped from the web. Because the researchers audited just 0.1% of CommonPool’s data, they estimate that the real number of images containing personally identifiable information, including faces and identity documents, is in the hundreds of millions. The study that details the breach was published on arXiv earlier this month.

Source below
The bottom line, says William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University and one of the coauthors, is that “anything you put online can [be] and probably has been scraped.”
The researchers found thousands of instances of validated identity documents—including images of credit cards, driver’s licenses, passports, and birth certificates—as well as over 800 validated job application documents (including résumés and cover letters), which were confirmed through LinkedIn and other web searches as being associated with real people. (In many more cases, the researchers did not have time to validate the documents or were unable to because of issues like image clarity.)
A number of the résumés disclosed sensitive information including disability status, the results of background checks, birth dates and birthplaces of dependents, and race. When résumés were linked to people with online presences, researchers also found contact information, government identifiers, sociodemographic information, face photographs, home addresses, and the contact information of other people (like references).
When it was released in 2023, DataComp CommonPool, with its 12.8 billion data samples, was the largest existing data set of publicly available image-text pairs, which are often used to train generative text-to-image models. While its curators said that CommonPool was intended for academic research, its license does not prohibit commercial use as well. 

CommonPool was created as a follow-up to the LAION-5B data set, which was used to train models including Stable Diffusion and Midjourney. It draws on the same data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022.
While commercial models often do not disclose what data sets they are trained on, the shared data sources of DataComp CommonPool and LAION-5B mean that the data sets are similar, and that the same personally identifiable information likely appears in LAION-5B, as well as in other downstream models trained on CommonPool data. CommonPool researchers did not respond to emailed questions.
And since DataComp CommonPool has been downloaded more than 2 million times over the past two years, it is likely that “there [are]many downstream models that are all trained on this exact data set,” says Rachel Hong, a PhD student in computer science at the University of Washington and the paper’s lead author. Those would duplicate similar privacy risks.
Good intentions are not enough!

“You can assume that any large-scale web-scraped data always contains content that shouldn’t be there,” says Abeba Birhane, a cognitive scientist and tech ethicist who leads Trinity College Dublin’s AI Accountability Lab—whether it’s personally identifiable information (PII), child sexual abuse imagery, or hate speech (which Birhane’s own research into LAION-5B has found).
Indeed, the curators of DataComp CommonPool were themselves aware it was likely that PII would appear in the data set and did take some measures to preserve privacy, including automatically detecting and blurring faces. But in their limited data set, Hong’s team found and validated over 800 faces that the algorithm had missed, and they estimated that overall, the algorithm had missed 102 million faces in the entire data set. On the other hand, they did not apply filters that could have recognized known PII character strings, like emails or Social Security numbers. 
“Filtering is extremely hard to do well,” says Agnew. “They would have had to make very significant advancements in PII detection and removal that they haven’t made public to be able to effectively filter this.”
There are other privacy issues that the face blurring doesn’t address. While the blurring filter is automatically applied, it is optional and can be removed. Additionally, the captions that often accompany the photos, as well as the photos’ metadata, often contain even more personal information, such as names and exact locations.
Another privacy mitigation measure comes from Hugging Face, a platform that distributes training data sets and hosts CommonPool, which integrates with a tool that theoretically allows people to search for and remove their own information from a data set. But as the researchers note in their paper, this would require people to know that their data is there to start with. When asked for comment, Florent Daudens of Hugging Face said that “maximizing the privacy of data subjects across the AI ecosystem takes a multilayered approach, which includes but is not limited to the widget mentioned,” and that the platform is “working with our community of users to move the needle in a more privacy-grounded direction.”
In any case, just getting your data removed from one data set probably isn’t enough. “Even if someone finds out their data was used in a training data sets and … exercises their right to deletion, technically the law is unclear about what that means,”  says Tiffany Li, an associate professor of law at the University of San Francisco School of Law. “If the organization only deletes data from the training data sets—but does not delete or retrain the already trained model—then the harm will nonetheless be done.”
The bottom line, says Agnew, is that “if you web-scrape, you’re going to have private data in there. Even if you filter, you’re still going to have private data in there, just because of the scale of this. And that’s something that we [machine-learning researchers], as a field, really need to grapple with.”
Reconsidering consent!

CommonPool was built on web data scraped between 2014 and 2022, meaning that many of the images likely date to before 2020, when ChatGPT was released. So even if it’s theoretically possible that some people consented to having their information publicly available to anyone on the web, they could not have consented to having their data used to train large AI models that did not yet exist.
And with web scrapers often scraping data from each other, an image that was originally uploaded by the owner to one specific location would often find its way into other image repositories. “I might upload something onto the internet, and then … a year or so later, [I] want to take it down, but then that [removal] doesn’t necessarily do anything anymore,” says Agnew.

The researchers also found numerous examples of children’s personal information, including depictions of birth certificates, passports, and health status, but in contexts suggesting that they had been shared for limited purposes.
“It really illuminates the original sin of AI systems built off public data—it’s extractive, misleading, and dangerous to people who have been using the internet with one framework of risk, never assuming it would all be hoovered up by a group trying to create an image generator,” says Ben Winters, the director of AI and privacy at the Consumer Federation of America.

Finding a policy that fits!

Ultimately, the paper calls for the machine-learning community to rethink the common practice of indiscriminate web scraping and also lays out the possible violations of current privacy laws represented by the existence of PII in massive machine-learning data sets, as well as the limitations of those laws’ ability to protect privacy.
“We have the GDPR in Europe, we have the CCPA in California, but there’s still no federal data protection law in America, which also means that different Americans have different rights protections,” says Marietje Schaake, a Dutch lawmaker turned tech policy expert who currently serves as a fellow at Stanford’s Cyber Policy Center. 

Besides, these privacy laws apply to companies that meet certain criteria for size and other characteristics. They do not necessarily apply to researchers like those who were responsible for creating and curating DataComp CommonPool.
And even state laws that do address privacy, like California’s consumer privacy act, have carve-outs for “publicly available” information. Machine-learning researchers have long operated on the principle that if it’s available on the internet, then it is public and no longer private information, but Hong, Agnew, and their colleagues hope that their research challenges this assumption.
“What we found is that ‘publicly available’ includes a lot of stuff that a lot of people might consider private—résumés, photos, credit card numbers, various IDs, news stories from when you were a child, your family blog. These are probably not things people want to just be used anywhere, for anything,” says Hong.  
Hopefully, Schaake says, this research “will raise alarm bells and create change.” 
(This article previously misstated Tiffany Li's affiliation. This has been fixed.)
@threadreaderapp unroll

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with CyberChick

CyberChick Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @warriors_mom

Jul 15
Cybersecurity 101:

Having LOCAL control of one’s own data in one’s own network is superior to using “The Cloud” simply for the fact that allowing your control over your data to be accessible (and thus controllable) by unknown admins in unknown data centers is allowing an immutable vulnerability to be injected directly into the heart of your system.

What is “The Cloud”?
Someone else’s computers somewhere else, with someone else having unlimited access and control of them.
ReconComputing.com OurWeb.io
“Clunky”? Retaining control over your own data while enabling ease of maintenance IS cybersecurity. With automatic filtering of known spamware & malware sites, ARKEN both boosts your network speeds AND reduces risks for users while making regulatory compliance & audit preparation easy!
ARKEN is literally plug & play.
Installation of our cybersecurity system automatically initiates a full inventory of all devices & processes within your network, identifying & alerting you to potential vulnerabilities & threats. Customizable I.R.P. allows users to easily manage & even prevent any malware install attempts in real-time. Do you know what’s on YOUR network?
Read 8 tweets
Jul 1
The FBI and Department of Justice (DOJ) on June 30 said that almost $15 billion was reported in losses in the “largest health care fraud” investigation in U.S. history, with officials charging more than 300 people in connection with the alleged scheme.
In a post on social media platform X, FBI Director Kash Patel wrote that $14.6 billion in losses were incurred, while $245 million was seized, as FBI Deputy Director Dan Bongino said in a separate post on X that hundreds of people were charged in the case.

Source below
“Public corruption will not be tolerated as the Director and I vigorously pursue bad actors who violated their oaths to all of us,” Bongino said, describing the case as the “largest healthcare fraud investigation” in the country’s history.
Read 11 tweets
Jun 28
SCOTUS: Justice Kagan’s Own Words Come Back to Haunt Her on Nationwide Injunctions. The Supreme Court’s 6-3 decision in Trump v. CASA, Inc., released Friday, finally put the brakes on the reckless abuse of nationwide injunctions by lower courts—and has Democrats in full meltdown mode. The left’s favorite judicial weapon just got neutered, and the hypocrisy is impossible to ignore.

Source below
The liberal wing of the court didn’t do itself any favors, either. Justice Ketanji Brown Jackson’s dissent was so horrible that Justice Amy Coney Barrett felt compelled to call it out in the majority opinion.
But Justice Elena Kagan’s credibility also took a direct hit. In a stunning display of judicial flip-flopping, Kagan’s own words from 2022 have come back to haunt her, exposing the left’s all-too-familiar habit of changing the rules when it suits their political objectives. 
Nationwide injunctions have been the left’s go-to tactic for derailing conservative policy at the stroke of a single judge’s pen. Under Trump, district judges from deep-blue enclaves repeatedly issued sweeping orders to block administration policies nationwide at an unprecedented pace, no matter how tenuous the legal grounds.
Read 8 tweets
Jun 19
HUGE WIN!
Supreme Court Sides With Energy Production in Latest NEPA Ruling: NEPA has become infamous for its role as a blockade to development in the United States. Signed into law on January 1, 1970, at fewer than six pages, NEPA’s original intent was to inform the public on the environmental impact of large projects through the publication of environmental reviews, called environmental impact statements. Over time, however, what began as a procedural safeguard has morphed into a burdensome processthat serves to stop projects through endless public commentary, hearings, bureaucratic delays, and legal challenges.
Source below 👇🏼
While there has been endless talk in Washington over the years about the need to reform NEPA, the Trump administration is taking decisive action. In response to the D.C. Circuit Court case Marin Audubon Society v. Federal Aviation Administration, which found that the Council on Environmental Quality (CEQ) lacks the statutory authority to issue binding rules on federal agencies, President Trump withdrew all CEQ NEPA guidance issued since 1977. These guidelines forced federal agencies to incorporate analyses — such as the consideration of climate impacts and environmental justice (added under the Biden administration) — into environmental reviews. These guidelines were not derived from the text of NEPA but had been added and modified over the decades to respond to judicial activism as well as the changing policy preferences of presidential administrations.
Congress has also recently made small efforts to address NEPA delays, passing a 2023 law limiting the page count and setting deadlines for environmental reviews. However, these were simply added on top of existing CEQ regulations and court precedents. While well intended, these changes fail to address the core issues with NEPA.
Read 9 tweets
Jun 19
FDA halts new clinical trials seeking to send American cells to hostile labs for genetic engineering: The Food and Drug Administration (FDA) on Wednesday halted and ordered a review of all new clinical trials that includes sending American cells to hostile nations for genetic engineering purposes.
Source below 👇🏼
The order comes after the Biden administration finalized a data security rule last year that included allowing U.S. companies to send cells and other biological samples of Americans to other countries for processing as part of the FDA's clinical trials. 
(Now you know why 23 & Me sold!)
Read 7 tweets
Apr 17
He Died So Close to Jesus—and Almost Missed Him: There’s a moment in *Heaven, How I Got Here* that absolutely guts you.
It’s the moment when the thief realizes how close he came to missing everything.
He was right there. Just a few feet from the Savior of the world. He could hear Him breathe. He could see the blood. He could listen to every word Jesus spoke.
✝️✝️✝️ Easter
And still—he almost missed Him.
Colin Smith brings that realization to life in gut-punching detail. He lets the thief narrate his own remorse—not about his crimes, but about how long he spent mocking the very One who came to save him.
See, at first, both criminals were railing at Jesus. It wasn’t just the crowd. It wasn’t just the soldiers. It was both men hanging beside Him.
He joined in.
He mocked the only One who could help him.
And he almost ran out of time.
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(