ITT: an OAI employee admits that the text-davinci API models are not from their papers.
Until @OpenAI actually documents the connection between the models in their papers and the models released via APIs, #NLProc researchers need to stop using them to do research.
@OpenAI This is not a minor point either. Apparently the text-davinci-002 API “is an instruct model. It doesn't uses a similar but slightly different [sic] training technique but it's not derived from davinci. Hence it's not a fair comparison.”
@OpenAI Note that the text-davinciplus-002 model that he mentions isn’t publicly available AFAIK. So external researchers trying to study the InstructGPT models not only are running the wrong models, but they can’t study the correct ones.
@OpenAI This is a disaster for scientific research… every external paper that evaluates the impact of instruction-tuning on model performance using the OpenAI API needs to be reviewed and very likely retracted.
@OpenAI This is far from the first time that @OpenAI has made secretive moves that undermine scientific research. The codex models have been changed over time for example, without any announcement or documentation: And even the original text APIs are problematic, as before @AiEleuther
@OpenAI@AiEleuther did a systematic analysis, many papers were written that made false assumptions about the size of the OpenAI GPT-3 models (they’re every other size starting from davinci = 175B and counting down)
@OpenAI@AiEleuther For example, @DanHendrycks’s widely cited “Measuring Massive Multitask Language Understanding” paper is wrong about the size of the models it evaluates
@OpenAI@AiEleuther@DanHendrycks Since Dan’s paper only looks at the OpenAI API, there was really no way for him to tell something was wrong. But here’s scaling plots for MMMLU categories comparing the OpenAI API models to publicly available ones. If MMMLU was right about model sizes
Over a year ago, several brilliant people at #EleutherAI started plugging VQGAN and CLIP together and getting it to generate images. By now there are many variations and adaptations of the technique out there, but for various reasons the OG paper is only just coming out
Huge props to @RiversHaveWings, @dashstander, @EricHallahan, @lcastricato, and the many other people who have iterated on and popularized this technique. I came rather late to the party, and mostly made sure that the experiments happened and their great work was showcased
@RiversHaveWings@dashstander@EricHallahan@lcastricato VQGAN-CLIP has really taken on a life of its own, getting picked up and modified in Jupiter notebooks shared on Twitter, Instagram, and other social media platforms
Chinchilla is *hugely* punching above its weight here. Damn.
@SashaMTL@TaliaRinger Hmmmm I coulda sworn I recently read something about how LLMs are Good for the Environment Actually (TM) because they're multitask models and one training run supports a lot of deployment, and yet here we are.
Phenomenal work on the linkage between LM performance and frequency of data in the pretraining dataset. As far as I am aware, this is the first paper to demonstrate such a connection outside of the work of people like @colinraffel and @katherine1ee and Carlini on memorization
To their credit, @OpenAI put this plot in their GPT-3 which looks like this. It appears to answer the question, but recent work (esp. @AlexTamkin’s newest paper) calls into question the validity of using a present / not present dichotomy to draw conclusion.
@OpenAI@AlexTamkin Evaluating languages models is very hard. Even building basic frameworks for few-shot evaluation that work with many LMs and many tasks is a lot of work.
Excited to share my newest paper, "Neural Language Models are Effective Plagiarists" with @EdwardRaffML. We took a dataset of CS 101 assignments and asked "can a language model do a good job solving these with minimal human intervention or knowledge?"
@EdwardRaffML There's been some very interesting work recently on solving college level assignments with transformers, but that work typically uses private models and more complicated pipelines. We wanted to focus on what was available to a random student with the internet, not an AI expert.
@EdwardRaffML To do that, we stuck with #EleutherAI's GPT-J, freely and publicly available at 6b.eleuther.ai. We used no prompting, no finetuning, and no tricks.
@MSFTResearch and @NVIDIAAI announce a 540B parameter large language model, 3x larger than GPT-3, achieving superior results on a variety of tasks. Trained on the Pile and evaluated on the Eval Harness, two of #EleutherAI’s biggest projects.
@MSFTResearch@NVIDIAAI The Pile is a curated dataset of high quality data for training language models. The project was lead by @nabla_theta and myself, with contribs from many others. Released on Jan 1st 2021, it was the first public massive language model training dataset
@MSFTResearch@NVIDIAAI@nabla_theta The 530B model is trained predominantly on the Pile, with a couple newer CC scrapes mixed in. The "newer" facet is quite important, as the data in the Pile was collected prior to July 31st, 2020. Any events that happened since that date (most notably the COVID pandemic)
Okay, time to live tweet my thoughts on @stanfordnlp@StanfordAILab's "Workshop on Foundation Models." A long thread.
First and foremost: please never use the phrase "foundational models" every again. It's a garbage name that people like @mmitchell_ai@emilymbender@mer__edith have criticized at length. I'll go find some of their comments and link to them later, but the short version is:
@mmitchell_ai@emilymbender@mer__edith 1. There is very little intellectually "foundational" about these models 2. It's not at all clear that GPT-3 and CLIP-DALL-E are the same kind of thing 3. The motivation for this relabeling appears to be entirely about political control over language