Tweet

Stella Rose

11 Oct, 32 tweets, 19 min read

@MSFTResearch

@MSFTResearch and @NVIDIAAI announce a 540B parameter large language model, 3x larger than GPT-3, achieving superior results on a variety of tasks. Trained on the Pile and evaluated on the Eval Harness, two of #EleutherAI’s biggest projects.

A 🧵

developer.nvidia.com/blog/using-dee…

@MSFTResearch

@MSFTResearch @NVIDIAAI The Pile is a curated dataset of high quality data for training language models. The project was lead by @nabla_theta and myself, with contribs from many others. Released on Jan 1st 2021, it was the first public massive language model training dataset

https://mobile.twitter.com/nabla_theta/status/1345130408170541056?lang=en

@MSFTResearch

@MSFTResearch @NVIDIAAI @nabla_theta The 530B model is trained predominantly on the Pile, with a couple newer CC scrapes mixed in. The "newer" facet is quite important, as the data in the Pile was collected prior to July 31st, 2020. Any events that happened since that date (most notably the COVID pandemic)

@MSFTResearch

@MSFTResearch @NVIDIAAI @nabla_theta are completely absent. Adding in newer data is a very good idea. Figuring out how to keep datasets (and more importantly models) updated is a hard and important question, both as a foundational question about how these models work and as a practical question about how to keep

@NVIDIAAI

these very expensive and very slow to train models relevant to people's interests months or years later.

I hope that @NVIDIAAI and @MSFTResearch will consider releasing their remix of the Pile, as well as information on how they chose how to go about updating it.

@OpenAI

The other major change they made to the Pile is that they seem to have (partially) deduplicated it. When we produced the Pile, the mainstream attitude was that you should train extra on "more important" data, a practice followed by @OpenAI for GPT-3

@katherine1ee

However subsequent work lead by subsequent research lead by @katherine1ee and @daphneipp has cast doubt on this. Megatron-Turing NLG 530B takes a hybrid approach, deduplicating the data but then training for extra epochs on some subsets.

arxiv.org/abs/2107.06499

@katherine1ee

This seems a bit weird to me: if you find @katherine1ee and @daphneipp's work compelling you should deduplicate. If you don't, you should upsample important data. Doing both makes little sense to me, unless the goal is to have more control over the upsampling by removing

@katherine1ee

unintended duplication of data.

Personally I find @katherine1ee and @daphneipp's paper quite compelling and #EleutherAI is currently working on a deduplicated version of the Pile which we will use to replicate the paper, and presumably use in future trainings.

@katherine1ee

@katherine1ee @daphneipp MT-NLG 540B comes with a host of interesting information about how it was trained with regards to parallelism and compute. It's good to see that the Megatron-DS codebase is performant at these scales, as #EleutherAI's current codebase is based on it

github.com/EleutherAI/gpt…

@katherine1ee

@katherine1ee @daphneipp As is work by @huggingface and @BigscienceW. On first pass nothing in the details seems particularly notable, except for the omission of information about instabilities. Training models beyond the 100B scale is extremely hard and careful work is required to navigate instabilities

@katherine1ee

@katherine1ee @daphneipp @huggingface @BigscienceW In terms of the evaluation, the blog post shows results competitive with and arguably better than GPT-3. Direct comparison between the models is misleading, as this model was trained for fewer tokens. On Lambada the MT-NLG 540B model slightly edges out GPT-3, comparing to this:

In all likelihood, one of the reasons that they chose to train for as long as they did is that it was necessary to get results that beat GPT-3. Hopefully they will release a paper with fuller data and a more complete picture of how their performance evolves.

User CRG in the EleutherAI discord points out that WinoGrande seems far less saturated than Lambada, but that MT-NLG 540B fails to beat GPT-3 on this task. Again, plot from the GPT-3 paper. MT-NLG 540B reports 0.730 (0-shot), 0.737 (1-shot), 0.789 (few-shot)

@BigscienceW

This may be due to the data they are using. Some preliminary results in @BigscienceW indicate that downstream performance is heavily dependent on the dataset and that different datasets are better for different tasks (paper coming soon!).

@BigscienceW

@BigscienceW The evaluation used another framework developed by #EleutherAI and lead by @nabla_theta (seriously, he's a phenomenal research engineer) that we call the LM Evaluation Harness. It provides a unified interface for evaluating models on a variety of tasks
github.com/EleutherAI/lm-…

@BigscienceW

@BigscienceW @nabla_theta With over 100 tasks and counting across a wide variety of domains, I believe that the Eval Harness is the premier framework for evaluating language models in a systematic and uniform way. It's quite gratifying to see @MSFTResearch and @NVIDIAAI promote it.

One thing that intruiges me is that they say "[t]o encourage reproducibility, we based our evaluation setting on the open-source project lm-evaluation-harness and made task-specific changes as appropriate to align our setting more closely with prior work."

@nabla_theta

@nabla_theta has put a lot of work into the design of the tests, and I'm curious what modifications were made. I would guess that the answer is that they used the normalization used in the GPT-3 paper. For the evaluation harness we opted to use a different normalization because

we felt that making the evaluation harness fully tokenizer agnostic. Tokenizers are a mess and most groups build their own. However the GPT-3 paper chose to (mostly) report scores normalized by the number of tokens, which makes comparisons for people who don't use the same

tokenizer pretty wonky. Specifically, GPT-3 uses the same tokenizer as GPT-2. The GPT-2 tokenizer is probably the most common one used for autoregressive language modeling and is available on HuggingFace: huggingface.co/transformers/m…

@nabla_theta

Multiple choice normalization for language models is a pretty complicated endeavor. @nabla_theta discusses the four commonly used methods in: blog.eleuther.ai/multiple-choic…

@nabla_theta

@nabla_theta Using the language in the above blog post, I believe the modification that the modification that the MT-NLG 540B blog post mentions is using token-length normalized and unconditional likelihood normalized scores instead of the unnormalized and byte-length normalized scores

@nabla_theta

@nabla_theta that the evaluation harness reports. This is a good thing to do if you want to be directly compared to GPT-3, but probably a bad thing to do in the long run because tokenizer-dependent normalization is a terrible design choice that accrues significant technical debt.

@nabla_theta

@nabla_theta Based and antitokenization-pilled

https://twitter.com/lorenlugosch/status/1447586939784282115?s=20

@nabla_theta

@nabla_theta A very interesting quote "We observed that the model can infer basic mathematical operations from context (sample 1), even when the symbols are badly obfuscated (sample 2). While far from claiming numeracy, the model seems to go beyond only memorization for arithmetic."

@nabla_theta

@nabla_theta The GPT-3 paper, as well as some other papers, look at the ability of models to learn new symbology. This continues in that tradition, and I hope that future blog posts and the paper that I assume is on the way investigate this systematically.

@nabla_theta

@nabla_theta The blog post also looks at the sensitivity of the model to rephrasing questions. This is something language models have been repeatedly shown to be very sensitive to, though I am not aware of any rigorous analyses.

@UTLaw

This issue is also studied in a recent paper by Noam Kolt (@UTLaw, @TorontoSRI) that investigates the performance of GPT-3 in a legal analysis setting
talk: youtube.com/c/CentreforEth…
paper: werobot2021.com/wp-content/upl…

https://twitter.com/UofTEthics/status/1382356765740838916?s=20

Share this page!

Stella Rose

Try unrolling a thread yourself!

More from @BlancheMinerva

Stella Rose

Stella Rose

Stella Rose

Stella Rose

Did Thread Reader help you today?

Like this author's thread?