I think we can call it shut on 'Open' AI: the 98 page paper introducing GPT-4 proudly declares that they're disclosing *nothing* about the contents of their training set.
Why should you care? Every piece of academic work on ML datasets has found consistent and problematic ways that training data conditions what the models outputs. (@safiyanoble, @merbroussard, @emilymbender, etc.) Indeed, that's the whole point! That's what training data is!
Choices of training data reflects historic biases and can inflict all sorts of harms. To ameliorate those harms, and to make informed decisions about where a model should *not* be used, we need to know what kinds of biases are built in. OpenAI's choices make this impossible.
Neural networks like GPT-4 are notoriously black boxes; the fact that their operations are unpredictable and inscrutable is one of *the* most important questions about whether and where they should be used. And now OpenAI is planting a standard to extend that mystery farther.
Their argument is basically a combination of 'trust us' and 'fine-tuning will fix it all.' But the way they've built corpora in the past shouldn't inspire trust. When OpenAI launched GPT-2, their brilliant idea was to find 'high quality' pages by using Reddit upvotes.
That probably beats the morass of regular web text, but the idea of Reddit upvotes as the gold standard for quality is--distopian? Last week we made a map of the open recreation of this corpus, OpenWebText-- it's crazy easy to find awful stuff. Try it! atlas.nomic.ai/map/owt
For GPT-3 that set served as a standard to filter sites out from the Common Crawl. We made a map of the Pile reproduction of that. I have no idea if OpenAI filtered stuff like the below out, or if r/the_donald gave it upvotes in the day. Neither do you. atlas.nomic.ai/map/cc8m
Here's a link to the paper. The whole thing is an fascinating artifact--it looks like an arxiv paper using the neurips latex template (@andriy_mulyar pointed this out), but it's posted on their own web site and is authored by a company, not people. cdn.openai.com/papers/gpt-4.p…
One last point from the comments: it's hard to believe that 'competition' and 'safety' are the only reasons for OpenAI's secrecy, when hiding training data makes it harder to follow the anti-Stability playbook and sue them for appropriating other's work. reuters.com/legal/transact…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
We've just released from @nomic_ai a new map for exploring 6+ million AI-generated images and the prompts used to created them, collected by @krea_ai. atlas.nomic.ai/map/809ef16a-5… The most exciting thing for me here is how this changes full-text search: here's why (thread).
In my life as a researcher, search engines have worse and worse on some axes. UX studies targeted at users push institutions towards single search-box interfaces with ordered lists of results show barely anything: NYU's search engine only shows 3 results for 'maps'!
This is fine for needle-in-haystack operations, but it also leaves people with no more knowledge of what they're searching over after a search than when they started. Facets help, but only when there's consistent metadata: with federated search, that's rarer and rarer.
There's a zombie idea that humanities majors somehow remain 'only for elites' as they fall nationwide. It's not true. Here are four majors at Yale over the last 35 years. Yale history made great hullabaloo a few years ago about reclaiming the 'largest major' title. But look:
I hate to talk about this because it contributes to the insane conflation of "the Ivies" with "higher education" that the NYT lives on. But inside the historical profession I *still* periodically hear people trying to draw lessons from the Yale comeback. historians.org/publications-a…
Or here's Princeton. (The CS series starts in 2019 because they reported under an engineering code before then.) Same pattern: collapse in history and English. I don't doubt that the Yale history department is good at creating a compelling narrative from a stray data point...
@AaronRHanlon@Ted_Underwood But I see this as the core of my disagreement. I see a lot of people trying to yoke the humanities to the non-applied sciences as shared practitioners of 'pure research,' and that's what I see I see as 'liberal arts-ism'. The belief that humanities vs STEM is incorrect, or
@AaronRHanlon@Ted_Underwood that there's a new ordering of things that could bring others around... I see these not as new ideas but as doomed attempts to bring about the 1990s/200s, when the humanities were ok. (I can't access your old Chronicle article which I think is where you spelled your thoughts out)
@AaronRHanlon@Ted_Underwood In the eyes of most people in universities, what the humanities do and teach *isn't* research at all. (At NYU, there is literally a definition of research that excludes humanities work). It's an embarrassment of posturing and empty rhetoric that fails at *science*. That's novel.
@ipeds_nces just released new data on degree completions for the 2021 class (the first class with a full semester during the pandemic.) History and Religion have both joined English in being down to half their 2000s peak; philosophy's rebound persists, while area studies falls.
Here's the raw size of all the fields (just BAs). The downtick in cultural, ethnic, and gender studies is notable--those had been the only fields *not* to get pulled down by the collapse of humanities majors. Also sharper-than normal drops in English, Comp Lit, languages...
Overall changes from 2019-2021. Computer Science just keeps growing--I'm not sure what's up with ROTC, but maybe a category shift. Surprising growth for psychology ("the worst major.")
In 1980 the median age of history authors published by Yale University press was 40: by 2013 it was nearly 60. It's striking how *different* the age profiles are across different presses.
Here's the same chart for *all* university press books.
Methodology notes: 1. This from the raw data dumps of the Library of Congress. I should probably be doing more with this set, it's insanely rich. 2. This includes reissues. Eyeballing I didn't see all that many, but they're there.
(+)
@TheHigherFriar nails the color dimension, (actual codes in image). And % paying a mortgage is a good guess because unlike--say--homeownership it captures the falloff. But it's not real estate.