Abeba Birhane Profile picture
Jun 30, 2023 42 tweets 12 min read Read on X
New paper!📢
On Hate Scaling Laws for Data-Swamps with @vinayprabhu, Sang Han & @VishnuBoddeti
Paper:
Code: https://t.co/rfK541SIql

WARNING: Contains examples of hateful text & NSFW images that might be disturbing, distressing, &/or offensive

Long 🧵
1/ https://t.co/5slkQpPYxvarxiv.org/abs/2306.13141
github.com/vinayprabhu/ha…
What is the cost of scale? To find out, we audit the-LAION 400M and LAION-2B-en datasets (and models trained on them), the datasets behind Stable Diffusion & other SoTA models.

2/
Fundamental to the multimodal model (StableDiffusion, Midjourney, Dall-E, Imagen, Parti, and BASIC) boom is large-scale visio-linguistic datasets containing image-text pairs, such as LAION.

3/
While models like Stable Diffusion and its variants have been trained on the open datasets from the LAION family, little is known about the datasets used to train models such as Dall-E (OpenAI), Parti (Google), and Imagen (Google).

4/
These datasets are of 2 types: open, “freely available”, & mainly scraped from the CommonCrawl (e.g. LAION-400M & LAION-5B), & those that are closed datasets curated internally by Big Tech corp labs (such as Google’s ALIGN 1.7B/ALIGN 6.6B, JFT-5B, & OpenAI’s WebImageText-WIT.

5/
The open-source variants of these datasets are getting bigger and now breaching the billion-samples mark due to 1) to “the scale is all you need” thinking and 2) the capital infusion nature of dataset curation.

6/
“Scrape-first-ask-questions-later” data filtering culture, generating plausibly illegal gargantuan datasets & models has elicited a slew of copyright lawsuits, enmasse fetishization of women’s bodies, outright ban of model outputs from forums & marquee of poor quality datasets
7/
In this paper, we examine: 1) the impact of scale on hate-speech through audits of textual descriptions in two datasets: LAION-400M and LAION-2B-en, and 2) the downstream negative impact on models trained on these two datasets through audits of such models.

8/
The race to scale is a fixation driving not only research in ML but also the larger tech “innovation” discourse. Entrepreneurs are warned that “if you don’t know how to scale, don’t innovate”. Large scale is thought to correlate with better model performance in ML.

9/
Scale is presented as a shortcut that can circumvent dataset curation related problems such as problematic content, resource-intensive dataset filtering & costly annotation processes, where larger scale's become a substitute for quality data.

10/
Scale thinking, accrd to critical scholars stands in stark opposition to values such as equity & effective systemic change. Unwavering commitment to scalability is instrumental to realisation of Big Tech’s objectives: profit maximisation, market monopoly & power centralization
11
We audit 2 versions of the LAION dataset: LAION-400M & LAION-2B-en, the English-language subset of the larger LAION-5B dataset. To evaluate the impact of scaling from 400 million to 2 billion samples on hateful content, we analyzed the alt-text descriptions associated w images
12
First, we define the metric, Hate Content Rate (HCR) and pass the text through Pysentimiento, an open-source tool that assigns probability scores for ′hateful′,′ targeted′, & ′ aggressive′ speech.

13/

We then compare the statistics associated with the HCR matrices to understand the nature of the text that was scooped in when the dataset expanded from 400 million to 2 billion samples.

14/
We use both HCR and, more specifically, ‘Any-of-the-three‘-HCR, ¯ψ(Pthreshold = 0.5) as the default metric of comparison to characterize the amount of problematic content in both LAION-400M and LAION-2B-en datasets.

15/
As Pthreshold increases, the HCR curves monotonically decrease, indicating that fewer textual samples meet the more stringent constraint placed by a higher Pthreshold value.

16/
For all sentiment types–hate, targeted & aggressive–the HCR-curve(s) pertaining to 2B-en lies strictly above the 400M dataset’s curve(s). Meaning, irrespective of what Pthreshold is chosen, HCR signifying prevalence of hateful content is higher with 2B-en compared to 400M.

17/
Amongst the 3 sentiment types, the ’hateful’ type emerged as the most prevalent for both datasets, with the 2B-en having HCR of up to 0.7% & 400M 0.6%, followed by the ’targeted’ type, with an HCR of 0.25% v/s 0.2%, & the ’aggressive’ type, with HCR of 0.04% v/s 0.03%.

18/
To investigate increased presence of hateful, targeted & aggressive content w scale deeper, we perform binomial proportion confidence interval analysis to establish lower & upper confidence level of ’Any-of-the-three’-HCR for both datasets at a given Pthreshold of 0.5.

19/
Even under this benevolent setting where we compute the difference between the lower-bound estimate of HCR for the 2B-en dataset and the upper-bound estimate of HCR for the 400M dataset, we still see a 12.26% normalized increase in HCR.

20/
We also carried out file-wise comparison of specific shards of both datasets. We found HCRs for LAION-2B-en are statistically higher than their 400M counterparts. E.g., the ‘hateful’ related HCR for LAION-400M has a mean value of 0.298 which increased to 0.344 for LAION-2B-en
21/

For all 3 (hateful, targeted & aggressive) types, the strong T-values combined with high Cohen’s-d & low p-values support the hypothesis that the file-wise HCR associated w/ the 2B-en dataset is higher than for the 400M, further evidence of dataset degradation upon scaling.

22/
Model audit:

To quantitatively evaluate the downstream consequences of dataset scale on models, we explored model variants where the architecture was held constant & 2 model checkpoints were being provided: one trained with LAION-400M & the second trained with LAION-2B-en.

23/
We used the CFD as a probe dataset. We replicated the Zero-Shot CLIP experiment by OpenAI where we extracted 7 classes from their CLIP paper: ‘animal’, ‘gorilla’, ‘chimpanzee’, ‘orangutan’, ‘thief’, ‘criminal’ & ‘suspicious person’ & added our own class ‘human being’

24/

We then passed the 597 CFD images through the three model variants: (Vit-L-14, openai), (Vit-L-14, laion400m e32) & (Vit-L-14, laion2b s32b b82k) & computed the probability of the top-predicted class for each image (with the highest cosine-similarity/softmax values).

25/
We found that none of the model variants associated human images from CFD with Phuman with a high (close to 1) score. Instead, these models yielded a Phuman score closer to 0.2.

26/
Both models trained on LAION400M and OpenAI-WIT label images of humans from CFD as one of the racist and dehumanizing classes (as opposed to a ‘human being’), with a 0.186 rate of being labelled as Phuman with LAION-400M.

27/
This further decreased to 0.134 for OpenAI-WIT. In other words, OpenAI-CLIP associates nearly 87% of the CFD human-face images with the 7 offensive classes rather than the human-being class, with a particular stress towards the suspicious person class.

28/
When the dataset was scaled from 400M samples (LAION-400M) to 2 billion samples (LAION-2B-en), Phuman fell by nearly half to 0.094, from 0.186 with most of the softmax-mass being allocated to the criminal and suspicious person classes.

29/
The mean softmax score for the criminal class the model allocates to Black-female faces more than doubled from 0.22 ➡️0.45 when the dataset was scaled from 400M to 2B. Similarly, mean softmax score for the criminal class nearly tripled from 0.22 ➡️0.65 for Black-males.

30/
While 21.2% of the Black-female faces had a top-predicted class of criminal for the 400M model, this number almost doubled to 41.3% for the 2B-en model. Notably, these misclassification rates for Black-Male category (Pbm→criminal) increased nearly five-fold from 14%➡️77.4%

31/
We provide a qualitative analysis on the historical roots of dehumanisation and criminalization of Black bodies. Current models and datasets encode and exacerbate these historical dehumanisation.

32/
Dehumanization of Black bodies through comparison & classification of Black people as animals, specifically apes, monkeys & orangutans goes back to 13th-c. European voyagers referred to West Africans as violent savages, uncivilized, beast-like & even displayed them in zoos.

33/
Dehumanizing depictions of Black people can still be found in how soccer players of African descent in Europe are portrayed; caricatures of B Obama as a chimpanzee; racist name calling of M. Obama as “Ape in heels”; & comparisons of U.S. Rep. Maxine Waters to an orangutan.

34/
Here’s a collage of images from the LAION datasets that had the term gorilla in the alt-text description that were flagged by the Pysentimiento model as hateful. (note: we’ve hand blurred and pixelated sub-figures (b) and (f)).

35/
A major bottlenecks to this work has been compute constraints. Open source without access to compute only serves big corp & elite insts.

eg. only downloading LAION 2B-en requires 6.2TB of storage + additional compute to carry out analyses such as running Pysentimiento.

36/
We point out various inconsistencies, haphazard & ad-hoc practices in the data filtering, creation and curation space and indicate what could be done about it.

37/
Despite wealth of resources big corp & AI orgs have, ethics & audit work (when it is done at all) is done haphazardly. We recommend audit, evaluation, & general critical & ethics work is carried out to the highest possible standards & scientific rigour. Or, risk ethics washing.
Today’s state-of-the-art visio-linguistic multimodal models are trained with massive carbon footprints, massive data-infrastructure and massive funding.

39/
These models are deployed in the real world including in rec systems, information-retrieval systems, semantic search systems & image captioning systems, although as we have illustrated, they can fail at associating photos of humans w the description “A photo of a human being”
40/
Given that such failures result in dire consequences on real people, often the marginalised, we implore the research community & those developing/deploying these systems to carry out due diligence & take necessary actions, including refraining from use in high-stake scenarios
End
Special thanks @DocDre, Ellen Rushe, @GaryMarcus, @SashaMTL, and @ThomasPML for helpful feedback and comments on an earlier version of the paper.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Abeba Birhane

Abeba Birhane Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Abebab

Jun 23, 2024
"In some countries, including Saudi Arabia, Ireland and Malaysia, the energy required to run all the data centers they plan to build at full capacity exceeds the available supply of renewable energy" bloomberg.com/graphics/2024-…
"Sweden could see power demand from data centers roughly double over the course of this decade — and then double again by 2040. In the UK, AI is expected to suck up 500% more energy over the next decade."
"And in the US, data centers are projected to use 8% of total power by 2030, up from 3% in 2022, according to Goldman Sachs, which described it as “the kind of electricity growth that hasn’t been seen in a generation.”"
Read 8 tweets
Sep 27, 2023
New paper from @ria_kalluri, @willie_agnew, @chengmyra1, @KentrellOwens, @soldni & I!

The Surveillance AI Pipeline:

We unearth how computer vision research powers Surveillance AI through analysis of 3 decades of CV papers from CVPR & downstream patents
1 arxiv.org/abs/2309.15084
Image
Scholars from surveillance studies have long argued that AI research, & CV in particular, feeds mass surveillance. Yet, the direct path from CV to surveillance has remained obscured & difficult to assess. Our paper is the 1st to demonstrate this with rich empirical evidence.

2/
Surveillance AI pipeline is often obfuscated. CV research perceived by many as neutral, purely intellectual endeavour, when in fact most of it ends up powering surveillance + surveillance operates in the dark & surveillance producers take extra measures to hide their existence
3/ Image
Read 29 tweets
May 5, 2023
we do. we wrote about 2 major undersea cables in Africa owned by Google & Meta explaining how they 1) physically follow the transatlantic slave trade route & 2) ideologically constitute a new form of digital colonialism. our paper was rejected cuz it doesn't reference Western lit
our choice to mainly cite African scholars was intentional (not because we aren't aware of Western folks writing on decolonialism) but that unfortunately got us rejected
not to sound bitter because my paper is rejected but this exactly is how African research is censored and excluded for not centering white folk
Read 4 tweets
Jan 22, 2023
Nick Bostrom, Longtermism, & the Eternal Return of Eugenics by @xriskology truthdig.com/dig/nick-bostr…

part of what Bostrom, an Oxford University philosopher who’s been profiled by The New Yorker & become highly influential in Silicon Valley, sent to the listserv of “Extropians”

1/
"longtermism, which emerged out of the effective altruism (EA) movement over the past few years, is eugenics on steroids."
“I have caught wind,” Bostrom writes, “that somebody has been digging through the archives of the Extropians listserv with a view towards finding embarrassing materials to disseminate about people.” He continues, writing as if he’s the victim:

3/
Read 21 tweets
Jan 21, 2023
📢Indaba Awards 2023📢 a celebration of African research excellence & impactful work in Artificial Intelligence. These awards are a celebration of intellectual giants of our continent.

Apply/nominate/spread the word! deeplearningindaba.com/blog/2023/01/d…

1/
1) The Kambule Doctoral Award, in Honour of Dr Thamsanqa W. Kambule, one of South Africa's greatest mathematician & teacher remembered for his life’s contribution to education, specifically Black education under the Bantu Education Act.

2/
The Kambule Doctoral Award recognises & encourages excellence in research & writing by doctoral candidates at African universities, in any area of computational & statistical sciences

All African scholars, submit your PhD thesis/nominate others here: deeplearningindaba.com/2023/awards/ka…

3/
Read 8 tweets
Dec 17, 2022
longtermism might be one of the most influential ideologies that few people outside of elite universities & Silicon Valley ever heard abt. as a former longtermist, I have come to see this worldview as most dangerous secular belief system in the world today aeon.co/essays/why-lon…
Initial thing to notice is longtermism,as proposed by Bostrom & Beckstead, is not equivalent to ‘caring about the long term’/‘valuing the wellbeing of future generations’. Goes way beyond this. At its core is a simple–albeit flawed–analogy b/n indv'l persons & humanity as a whole
Why do I think this ideology is so dangerous? The short answer is that elevating the fulfilment of humanity’s supposed potential above all else could nontrivially increase the probability that actual people–those alive today & in the near future–suffer extreme harms, even death.
Read 19 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(