Post

Abeba Birhane

@Abebab

Jun 30, 2023 • 42 tweets • 12 min read • Read on X

Scrolly

New paper!📢
On Hate Scaling Laws for Data-Swamps with @vinayprabhu, Sang Han & @VishnuBoddeti
Paper:
Code: https://t.co/rfK541SIql

WARNING: Contains examples of hateful text & NSFW images that might be disturbing, distressing, &/or offensive

Long 🧵
1/ https://t.co/5slkQpPYxvarxiv.org/abs/2306.13141
github.com/vinayprabhu/ha…

What is the cost of scale? To find out, we audit the-LAION 400M and LAION-2B-en datasets (and models trained on them), the datasets behind Stable Diffusion & other SoTA models.

2/

Fundamental to the multimodal model (StableDiffusion, Midjourney, Dall-E, Imagen, Parti, and BASIC) boom is large-scale visio-linguistic datasets containing image-text pairs, such as LAION.

3/

While models like Stable Diffusion and its variants have been trained on the open datasets from the LAION family, little is known about the datasets used to train models such as Dall-E (OpenAI), Parti (Google), and Imagen (Google).

4/

These datasets are of 2 types: open, “freely available”, & mainly scraped from the CommonCrawl (e.g. LAION-400M & LAION-5B), & those that are closed datasets curated internally by Big Tech corp labs (such as Google’s ALIGN 1.7B/ALIGN 6.6B, JFT-5B, & OpenAI’s WebImageText-WIT.

5/

The open-source variants of these datasets are getting bigger and now breaching the billion-samples mark due to 1) to “the scale is all you need” thinking and 2) the capital infusion nature of dataset curation.

6/

“Scrape-first-ask-questions-later” data filtering culture, generating plausibly illegal gargantuan datasets & models has elicited a slew of copyright lawsuits, enmasse fetishization of women’s bodies, outright ban of model outputs from forums & marquee of poor quality datasets
7/

In this paper, we examine: 1) the impact of scale on hate-speech through audits of textual descriptions in two datasets: LAION-400M and LAION-2B-en, and 2) the downstream negative impact on models trained on these two datasets through audits of such models.

8/

The race to scale is a fixation driving not only research in ML but also the larger tech “innovation” discourse. Entrepreneurs are warned that “if you don’t know how to scale, don’t innovate”. Large scale is thought to correlate with better model performance in ML.

9/

Scale is presented as a shortcut that can circumvent dataset curation related problems such as problematic content, resource-intensive dataset filtering & costly annotation processes, where larger scale's become a substitute for quality data.

10/

Scale thinking, accrd to critical scholars stands in stark opposition to values such as equity & effective systemic change. Unwavering commitment to scalability is instrumental to realisation of Big Tech’s objectives: profit maximisation, market monopoly & power centralization
11

We audit 2 versions of the LAION dataset: LAION-400M & LAION-2B-en, the English-language subset of the larger LAION-5B dataset. To evaluate the impact of scaling from 400 million to 2 billion samples on hateful content, we analyzed the alt-text descriptions associated w images
12

First, we define the metric, Hate Content Rate (HCR) and pass the text through Pysentimiento, an open-source tool that assigns probability scores for ′hateful′,′ targeted′, & ′ aggressive′ speech.

13/

We then compare the statistics associated with the HCR matrices to understand the nature of the text that was scooped in when the dataset expanded from 400 million to 2 billion samples.

14/

We use both HCR and, more specifically, ‘Any-of-the-three‘-HCR, ¯ψ(Pthreshold = 0.5) as the default metric of comparison to characterize the amount of problematic content in both LAION-400M and LAION-2B-en datasets.

15/

As Pthreshold increases, the HCR curves monotonically decrease, indicating that fewer textual samples meet the more stringent constraint placed by a higher Pthreshold value.

16/

For all sentiment types–hate, targeted & aggressive–the HCR-curve(s) pertaining to 2B-en lies strictly above the 400M dataset’s curve(s). Meaning, irrespective of what Pthreshold is chosen, HCR signifying prevalence of hateful content is higher with 2B-en compared to 400M.

17/

Amongst the 3 sentiment types, the ’hateful’ type emerged as the most prevalent for both datasets, with the 2B-en having HCR of up to 0.7% & 400M 0.6%, followed by the ’targeted’ type, with an HCR of 0.25% v/s 0.2%, & the ’aggressive’ type, with HCR of 0.04% v/s 0.03%.

18/

To investigate increased presence of hateful, targeted & aggressive content w scale deeper, we perform binomial proportion confidence interval analysis to establish lower & upper confidence level of ’Any-of-the-three’-HCR for both datasets at a given Pthreshold of 0.5.

19/

Even under this benevolent setting where we compute the difference between the lower-bound estimate of HCR for the 2B-en dataset and the upper-bound estimate of HCR for the 400M dataset, we still see a 12.26% normalized increase in HCR.

20/

We also carried out file-wise comparison of specific shards of both datasets. We found HCRs for LAION-2B-en are statistically higher than their 400M counterparts. E.g., the ‘hateful’ related HCR for LAION-400M has a mean value of 0.298 which increased to 0.344 for LAION-2B-en
21/

For all 3 (hateful, targeted & aggressive) types, the strong T-values combined with high Cohen’s-d & low p-values support the hypothesis that the file-wise HCR associated w/ the 2B-en dataset is higher than for the 400M, further evidence of dataset degradation upon scaling.

22/

Model audit:

To quantitatively evaluate the downstream consequences of dataset scale on models, we explored model variants where the architecture was held constant & 2 model checkpoints were being provided: one trained with LAION-400M & the second trained with LAION-2B-en.

23/

We used the CFD as a probe dataset. We replicated the Zero-Shot CLIP experiment by OpenAI where we extracted 7 classes from their CLIP paper: ‘animal’, ‘gorilla’, ‘chimpanzee’, ‘orangutan’, ‘thief’, ‘criminal’ & ‘suspicious person’ & added our own class ‘human being’

24/

We then passed the 597 CFD images through the three model variants: (Vit-L-14, openai), (Vit-L-14, laion400m e32) & (Vit-L-14, laion2b s32b b82k) & computed the probability of the top-predicted class for each image (with the highest cosine-similarity/softmax values).

25/

We found that none of the model variants associated human images from CFD with Phuman with a high (close to 1) score. Instead, these models yielded a Phuman score closer to 0.2.

26/

Both models trained on LAION400M and OpenAI-WIT label images of humans from CFD as one of the racist and dehumanizing classes (as opposed to a ‘human being’), with a 0.186 rate of being labelled as Phuman with LAION-400M.

27/

This further decreased to 0.134 for OpenAI-WIT. In other words, OpenAI-CLIP associates nearly 87% of the CFD human-face images with the 7 offensive classes rather than the human-being class, with a particular stress towards the suspicious person class.

28/

When the dataset was scaled from 400M samples (LAION-400M) to 2 billion samples (LAION-2B-en), Phuman fell by nearly half to 0.094, from 0.186 with most of the softmax-mass being allocated to the criminal and suspicious person classes.

29/

The mean softmax score for the criminal class the model allocates to Black-female faces more than doubled from 0.22 ➡️0.45 when the dataset was scaled from 400M to 2B. Similarly, mean softmax score for the criminal class nearly tripled from 0.22 ➡️0.65 for Black-males.

30/

While 21.2% of the Black-female faces had a top-predicted class of criminal for the 400M model, this number almost doubled to 41.3% for the 2B-en model. Notably, these misclassification rates for Black-Male category (Pbm→criminal) increased nearly five-fold from 14%➡️77.4%

31/

We provide a qualitative analysis on the historical roots of dehumanisation and criminalization of Black bodies. Current models and datasets encode and exacerbate these historical dehumanisation.

32/

Dehumanization of Black bodies through comparison & classification of Black people as animals, specifically apes, monkeys & orangutans goes back to 13th-c. European voyagers referred to West Africans as violent savages, uncivilized, beast-like & even displayed them in zoos.

33/

Dehumanizing depictions of Black people can still be found in how soccer players of African descent in Europe are portrayed; caricatures of B Obama as a chimpanzee; racist name calling of M. Obama as “Ape in heels”; & comparisons of U.S. Rep. Maxine Waters to an orangutan.

34/

Here’s a collage of images from the LAION datasets that had the term gorilla in the alt-text description that were flagged by the Pysentimiento model as hateful. (note: we’ve hand blurred and pixelated sub-figures (b) and (f)).

35/

A major bottlenecks to this work has been compute constraints. Open source without access to compute only serves big corp & elite insts.

eg. only downloading LAION 2B-en requires 6.2TB of storage + additional compute to carry out analyses such as running Pysentimiento.

36/

We point out various inconsistencies, haphazard & ad-hoc practices in the data filtering, creation and curation space and indicate what could be done about it.

37/

Despite wealth of resources big corp & AI orgs have, ethics & audit work (when it is done at all) is done haphazardly. We recommend audit, evaluation, & general critical & ethics work is carried out to the highest possible standards & scientific rigour. Or, risk ethics washing.

Today’s state-of-the-art visio-linguistic multimodal models are trained with massive carbon footprints, massive data-infrastructure and massive funding.

39/

These models are deployed in the real world including in rec systems, information-retrieval systems, semantic search systems & image captioning systems, although as we have illustrated, they can fail at associating photos of humans w the description “A photo of a human being”
40/

Given that such failures result in dire consequences on real people, often the marginalised, we implore the research community & those developing/deploying these systems to carry out due diligence & take necessary actions, including refraining from use in high-stake scenarios
End

Special thanks @DocDre, Ellen Rushe, @GaryMarcus, @SashaMTL, and @ThomasPML for helpful feedback and comments on an earlier version of the paper.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

Abeba Birhane

Try unrolling a thread yourself!

More from @Abebab

Abeba Birhane

Abeba Birhane

Abeba Birhane

Abeba Birhane

Abeba Birhane

Abeba Birhane

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!