Tweet

Stella Rose Biderman

Apr 4 • 16 tweets • 9 min read

Google decided that 137B and 280B weren't enough, so now they've gone and trained a 540B model.

ai.googleblog.com/2022/04/pathwa…

Chinchilla is *hugely* punching above its weight here. Damn.

@SashaMTL

@SashaMTL @TaliaRinger Hmmmm I coulda sworn I recently read something about how LLMs are Good for the Environment Actually (TM) because they're multitask models and one training run supports a lot of deployment, and yet here we are.

@SashaMTL

@SashaMTL @TaliaRinger Has anyone studied the value of drawing straight lines on log-log plots through three data points? We see stuff like this all the time, but using these plots to interpolate seems extremely suspicious to me.

@SashaMTL

@SashaMTL @TaliaRinger It's cool to see Ben and @arankomatsuzaki's parallel layers get picked up outside of #EleutherAI!

"RoPE has been shown to have better performance on long sequence lengths" is a bit of a funny thing to read since @OfirPress's ALIBI seems to beat it on this specific task.

Axillary z-loss shows up again. Has anyone experimented with this? I haven't, but it seems like it's gaining traction.

This would be a god-send in a hypothetical world in which people have access to the checkpoints. Expect this to power some interesting papers coming out of Google about training dynamics in the near future though.

@StasBekman

Some interesting info contra GPT-NeoX and the Big Science Research workshop paper on training instability. Both papers claim spikes are caused due to bad data. @StasBekman do you recall ever running this experiment? I didn't, though now I'm interested in seeing if I can.

@StasBekman

@StasBekman Evaluating memorization is continuing to make it into the mainstream reporting on language models.

@StasBekman

@StasBekman Very cool to see them reproduce Carlini et al.'s equation so closely. The actual scaling law curves are usually pretty sensitive to the model choice, but here it's not. Given the architectural similarity, maybe this really is a big EleutherAI model?

arxiv.org/abs/2202.07646

@StasBekman

@StasBekman I'm mostly joking. That caught my attention, but drawing a conclusion like that at this stage would be highly premature. It would be very interesting if memorization scaling laws transferred though, especially across models trained on different datasets. 🤔🤔🤔

@StasBekman

@StasBekman BIG Bench was designed to be very hard for current LLMs to solve, but it may last less time than intended... performance is clearly shifting towards the AI. I'm pretty curious where my task (indentify_math_theorems) falls on this plot, but I don't see the raw data anywhere.

@StasBekman

@StasBekman Check out the benchmark here: github.com/google/BIG-ben…

Yes, I know it's hard to use if your model isn't in TensorFlow. I would like to get that fixed but it's not my project. I might take a stab at integrating it with the EleutherAI LM Eval Harness again soon though.

@StasBekman

@StasBekman Everything in @emilymbender's thread here is spot on. It's also worth pointing out some math people regularly slip up on: 30% of documents have names, 10% of names are in BookCorpus -> 26.7% of names are not in BC.

https://twitter.com/emilymbender/status/1511123935085006848?s=20&t=NiG3T46EHkuiTg5NU4oROg

@StasBekman

@StasBekman @emilymbender Yes I know those numbers don't add up, it's actually 11% in BC I just mistyped. If it were 10%, then 27% of names would not be in BC.

Big OOF

https://twitter.com/emilymbender/status/1511129363210670082?s=20&t=NiG3T46EHkuiTg5NU4oROg

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @BlancheMinerva

Stella Rose Biderman

@BlancheMinerva

Feb 19

@colinraffel

Phenomenal work on the linkage between LM performance and frequency of data in the pretraining dataset. As far as I am aware, this is the first paper to demonstrate such a connection outside of the work of people like @colinraffel and @katherine1ee and Carlini on memorization

https://twitter.com/AlexTamkin/status/1494937726822588423

@OpenAI

To their credit, @OpenAI put this plot in their GPT-3 which looks like this. It appears to answer the question, but recent work (esp. @AlexTamkin’s newest paper) calls into question the validity of using a present / not present dichotomy to draw conclusion.

@OpenAI

@OpenAI @AlexTamkin Evaluating languages models is very hard. Even building basic frameworks for few-shot evaluation that work with many LMs and many tasks is a lot of work.

That’s why @nabla_theta and Jonathan Tow have been working to build our own framework from scratch: github.com/EleutherAI/lm-…

Read 4 tweets

Stella Rose Biderman

@BlancheMinerva

Jan 20

@EdwardRaffML

Excited to share my newest paper, "Neural Language Models are Effective Plagiarists" with @EdwardRaffML. We took a dataset of CS 101 assignments and asked "can a language model do a good job solving these with minimal human intervention or knowledge?"

arxiv.org/abs/2201.07406

@EdwardRaffML

@EdwardRaffML There's been some very interesting work recently on solving college level assignments with transformers, but that work typically uses private models and more complicated pipelines. We wanted to focus on what was available to a random student with the internet, not an AI expert.

@EdwardRaffML

@EdwardRaffML To do that, we stuck with #EleutherAI's GPT-J, freely and publicly available at 6b.eleuther.ai. We used no prompting, no finetuning, and no tricks.

Read 14 tweets

Stella Rose Biderman

@BlancheMinerva

Oct 11, 2021

@MSFTResearch

@MSFTResearch and @NVIDIAAI announce a 540B parameter large language model, 3x larger than GPT-3, achieving superior results on a variety of tasks. Trained on the Pile and evaluated on the Eval Harness, two of #EleutherAI’s biggest projects.

A 🧵

developer.nvidia.com/blog/using-dee…

@MSFTResearch

@MSFTResearch @NVIDIAAI The Pile is a curated dataset of high quality data for training language models. The project was lead by @nabla_theta and myself, with contribs from many others. Released on Jan 1st 2021, it was the first public massive language model training dataset

https://mobile.twitter.com/nabla_theta/status/1345130408170541056?lang=en

@MSFTResearch

@MSFTResearch @NVIDIAAI @nabla_theta The 530B model is trained predominantly on the Pile, with a couple newer CC scrapes mixed in. The "newer" facet is quite important, as the data in the Pile was collected prior to July 31st, 2020. Any events that happened since that date (most notably the COVID pandemic)

Read 32 tweets

Stella Rose Biderman

@BlancheMinerva

Aug 23, 2021

@stanfordnlp

Okay, time to live tweet my thoughts on @stanfordnlp @StanfordAILab's "Workshop on Foundation Models." A long thread.

@mmitchell_ai

First and foremost: please never use the phrase "foundational models" every again. It's a garbage name that people like @mmitchell_ai @emilymbender @mer__edith have criticized at length. I'll go find some of their comments and link to them later, but the short version is:

@mmitchell_ai

@mmitchell_ai @emilymbender @mer__edith 1. There is very little intellectually "foundational" about these models
2. It's not at all clear that GPT-3 and CLIP-DALL-E are the same kind of thing
3. The motivation for this relabeling appears to be entirely about political control over language

Read 60 tweets

Stella Rose Biderman

@BlancheMinerva

Jul 3, 2021

https://twitter.com/Abebab/status/1410267861130620928

Phenomenally interesting paper about how AI researchers talk about what they value in their research. Very glad the authors took the time to do this laborious but important work. I'm going to keep this in my desk so the next time I go on a rant about how ML is prescriptive [1/?]

https://twitter.com/Abebab/status/1410267861130620928

rather than descriptive I can wack people who disagree with this paper 😛

I would actually go further than the authors of this paper do (I don't know if they disagree with what I'm about to say, but they didn't say it): I would say that corporate AI research [2/?]

is a propaganda tool that is actively and deliberately wielded to influence policy, regulation, and ethics conversations about technology. The very way mainstream AI research - even "AI Ethics" research - is framed obliviates consequences for the companies. [3/?]

Read 26 tweets

Stella Rose Biderman

@BlancheMinerva

Jul 2, 2021

@RiversHaveWings

@RiversHaveWings is a phenomenal artist and her work with CLIP is simply stunning

https://twitter.com/RiversHaveWings/status/1410020043178446848

@RiversHaveWings

@RiversHaveWings

https://twitter.com/RiversHaveWings/status/1406347245297881088?s=20

https://twitter.com/RiversHaveWings/status/1406047943199576065?s=20

https://twitter.com/RiversHaveWings/status/1406047943199576065?s=20

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Stella Rose Biderman

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @BlancheMinerva

Stella Rose Biderman

Stella Rose Biderman

Stella Rose Biderman

Stella Rose Biderman

Stella Rose Biderman

Stella Rose Biderman

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Like this author's thread?