@MSFTResearch and @NVIDIAAI announce a 540B parameter large language model, 3x larger than GPT-3, achieving superior results on a variety of tasks. Trained on the Pile and evaluated on the Eval Harness, two of #EleutherAI’s biggest projects.
@MSFTResearch@NVIDIAAI The Pile is a curated dataset of high quality data for training language models. The project was lead by @nabla_theta and myself, with contribs from many others. Released on Jan 1st 2021, it was the first public massive language model training dataset
@MSFTResearch@NVIDIAAI@nabla_theta The 530B model is trained predominantly on the Pile, with a couple newer CC scrapes mixed in. The "newer" facet is quite important, as the data in the Pile was collected prior to July 31st, 2020. Any events that happened since that date (most notably the COVID pandemic)
@MSFTResearch@NVIDIAAI@nabla_theta are completely absent. Adding in newer data is a very good idea. Figuring out how to keep datasets (and more importantly models) updated is a hard and important question, both as a foundational question about how these models work and as a practical question about how to keep
these very expensive and very slow to train models relevant to people's interests months or years later.
I hope that @NVIDIAAI and @MSFTResearch will consider releasing their remix of the Pile, as well as information on how they chose how to go about updating it.
The other major change they made to the Pile is that they seem to have (partially) deduplicated it. When we produced the Pile, the mainstream attitude was that you should train extra on "more important" data, a practice followed by @OpenAI for GPT-3
However subsequent work lead by subsequent research lead by @katherine1ee and @daphneipp has cast doubt on this. Megatron-Turing NLG 530B takes a hybrid approach, deduplicating the data but then training for extra epochs on some subsets.
This seems a bit weird to me: if you find @katherine1ee and @daphneipp's work compelling you should deduplicate. If you don't, you should upsample important data. Doing both makes little sense to me, unless the goal is to have more control over the upsampling by removing
unintended duplication of data.
Personally I find @katherine1ee and @daphneipp's paper quite compelling and #EleutherAI is currently working on a deduplicated version of the Pile which we will use to replicate the paper, and presumably use in future trainings.
@katherine1ee@daphneipp MT-NLG 540B comes with a host of interesting information about how it was trained with regards to parallelism and compute. It's good to see that the Megatron-DS codebase is performant at these scales, as #EleutherAI's current codebase is based on it
@katherine1ee@daphneipp As is work by @huggingface and @BigscienceW. On first pass nothing in the details seems particularly notable, except for the omission of information about instabilities. Training models beyond the 100B scale is extremely hard and careful work is required to navigate instabilities
@katherine1ee@daphneipp@huggingface@BigscienceW In terms of the evaluation, the blog post shows results competitive with and arguably better than GPT-3. Direct comparison between the models is misleading, as this model was trained for fewer tokens. On Lambada the MT-NLG 540B model slightly edges out GPT-3, comparing to this:
In all likelihood, one of the reasons that they chose to train for as long as they did is that it was necessary to get results that beat GPT-3. Hopefully they will release a paper with fuller data and a more complete picture of how their performance evolves.
User CRG in the EleutherAI discord points out that WinoGrande seems far less saturated than Lambada, but that MT-NLG 540B fails to beat GPT-3 on this task. Again, plot from the GPT-3 paper. MT-NLG 540B reports 0.730 (0-shot), 0.737 (1-shot), 0.789 (few-shot)
This may be due to the data they are using. Some preliminary results in @BigscienceW indicate that downstream performance is heavily dependent on the dataset and that different datasets are better for different tasks (paper coming soon!).
@BigscienceW The evaluation used another framework developed by #EleutherAI and lead by @nabla_theta (seriously, he's a phenomenal research engineer) that we call the LM Evaluation Harness. It provides a unified interface for evaluating models on a variety of tasks github.com/EleutherAI/lm-…
@BigscienceW@nabla_theta With over 100 tasks and counting across a wide variety of domains, I believe that the Eval Harness is the premier framework for evaluating language models in a systematic and uniform way. It's quite gratifying to see @MSFTResearch and @NVIDIAAI promote it.
One thing that intruiges me is that they say "[t]o encourage reproducibility, we based our evaluation setting on the open-source project lm-evaluation-harness and made task-specific changes as appropriate to align our setting more closely with prior work."
@nabla_theta has put a lot of work into the design of the tests, and I'm curious what modifications were made. I would guess that the answer is that they used the normalization used in the GPT-3 paper. For the evaluation harness we opted to use a different normalization because
we felt that making the evaluation harness fully tokenizer agnostic. Tokenizers are a mess and most groups build their own. However the GPT-3 paper chose to (mostly) report scores normalized by the number of tokens, which makes comparisons for people who don't use the same
tokenizer pretty wonky. Specifically, GPT-3 uses the same tokenizer as GPT-2. The GPT-2 tokenizer is probably the most common one used for autoregressive language modeling and is available on HuggingFace: huggingface.co/transformers/m…
@nabla_theta Using the language in the above blog post, I believe the modification that the modification that the MT-NLG 540B blog post mentions is using token-length normalized and unconditional likelihood normalized scores instead of the unnormalized and byte-length normalized scores
@nabla_theta that the evaluation harness reports. This is a good thing to do if you want to be directly compared to GPT-3, but probably a bad thing to do in the long run because tokenizer-dependent normalization is a terrible design choice that accrues significant technical debt.
@nabla_theta A very interesting quote "We observed that the model can infer basic mathematical operations from context (sample 1), even when the symbols are badly obfuscated (sample 2). While far from claiming numeracy, the model seems to go beyond only memorization for arithmetic."
@nabla_theta The GPT-3 paper, as well as some other papers, look at the ability of models to learn new symbology. This continues in that tradition, and I hope that future blog posts and the paper that I assume is on the way investigate this systematically.
@nabla_theta The blog post also looks at the sensitivity of the model to rephrasing questions. This is something language models have been repeatedly shown to be very sensitive to, though I am not aware of any rigorous analyses.
That's all for now. I will probably do a deeper dive if and when further details become available, but this is about as much as I can say based on the limited information currently available. Congrats to everyone involved in this project, and I look forward to the paper.
Looks like I beat the official tweet announcing the model, lol.
Okay, time to live tweet my thoughts on @stanfordnlp@StanfordAILab's "Workshop on Foundation Models." A long thread.
First and foremost: please never use the phrase "foundational models" every again. It's a garbage name that people like @mmitchell_ai@emilymbender@mer__edith have criticized at length. I'll go find some of their comments and link to them later, but the short version is:
@mmitchell_ai@emilymbender@mer__edith 1. There is very little intellectually "foundational" about these models 2. It's not at all clear that GPT-3 and CLIP-DALL-E are the same kind of thing 3. The motivation for this relabeling appears to be entirely about political control over language
Phenomenally interesting paper about how AI researchers talk about what they value in their research. Very glad the authors took the time to do this laborious but important work. I'm going to keep this in my desk so the next time I go on a rant about how ML is prescriptive [1/?]
rather than descriptive I can wack people who disagree with this paper 😛
I would actually go further than the authors of this paper do (I don't know if they disagree with what I'm about to say, but they didn't say it): I would say that corporate AI research [2/?]
is a propaganda tool that is actively and deliberately wielded to influence policy, regulation, and ethics conversations about technology. The very way mainstream AI research - even "AI Ethics" research - is framed obliviates consequences for the companies. [3/?]
Great write up about the crazy cool art #EleutherAI members have been learning to coax out of GANs with CLIP! Credit assignment with stuff like this is hard, but @jbusted1@RiversHaveWings@BoneAmputee and @kialuy are some of the people who have made this happen.
@jbusted1@RiversHaveWings@BoneAmputee@kialuy They’ve been doing some visionary work with human-guided AI-generated art for the past two months, and it’s phenomenal that they’re starting to get the recognition they deserve. Several more people who either lack twitters or whose handles I don’t know deserve applause too