Stella Biderman Profile picture
Apr 4, 2022 16 tweets 9 min read Read on X
Google decided that 137B and 280B weren't enough, so now they've gone and trained a 540B model.

ai.googleblog.com/2022/04/pathwa…
Chinchilla is *hugely* punching above its weight here. Damn.
@SashaMTL @TaliaRinger Hmmmm I coulda sworn I recently read something about how LLMs are Good for the Environment Actually (TM) because they're multitask models and one training run supports a lot of deployment, and yet here we are.
@SashaMTL @TaliaRinger Has anyone studied the value of drawing straight lines on log-log plots through three data points? We see stuff like this all the time, but using these plots to interpolate seems extremely suspicious to me.
@SashaMTL @TaliaRinger It's cool to see Ben and @arankomatsuzaki's parallel layers get picked up outside of #EleutherAI!

"RoPE has been shown to have better performance on long sequence lengths" is a bit of a funny thing to read since @OfirPress's ALIBI seems to beat it on this specific task.
Axillary z-loss shows up again. Has anyone experimented with this? I haven't, but it seems like it's gaining traction.
This would be a god-send in a hypothetical world in which people have access to the checkpoints. Expect this to power some interesting papers coming out of Google about training dynamics in the near future though.
Some interesting info contra GPT-NeoX and the Big Science Research workshop paper on training instability. Both papers claim spikes are caused due to bad data. @StasBekman do you recall ever running this experiment? I didn't, though now I'm interested in seeing if I can.
@StasBekman Evaluating memorization is continuing to make it into the mainstream reporting on language models.
@StasBekman Very cool to see them reproduce Carlini et al.'s equation so closely. The actual scaling law curves are usually pretty sensitive to the model choice, but here it's not. Given the architectural similarity, maybe this really is a big EleutherAI model?

arxiv.org/abs/2202.07646
@StasBekman I'm mostly joking. That caught my attention, but drawing a conclusion like that at this stage would be highly premature. It would be very interesting if memorization scaling laws transferred though, especially across models trained on different datasets. 🤔🤔🤔
@StasBekman BIG Bench was designed to be very hard for current LLMs to solve, but it may last less time than intended... performance is clearly shifting towards the AI. I'm pretty curious where my task (indentify_math_theorems) falls on this plot, but I don't see the raw data anywhere.
@StasBekman Check out the benchmark here: github.com/google/BIG-ben…

Yes, I know it's hard to use if your model isn't in TensorFlow. I would like to get that fixed but it's not my project. I might take a stab at integrating it with the EleutherAI LM Eval Harness again soon though.
@StasBekman Everything in @emilymbender's thread here is spot on. It's also worth pointing out some math people regularly slip up on: 30% of documents have names, 10% of names are in BookCorpus -> 26.7% of names are not in BC.

@StasBekman @emilymbender Yes I know those numbers don't add up, it's actually 11% in BC I just mistyped. If it were 10%, then 27% of names would not be in BC.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Stella Biderman

Stella Biderman Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @BlancheMinerva

Feb 20
"The amount of FLOPs it requires to train a LLM grows quadratically with sequence length" is a false statement for all practical purposes and cannot die quickly enough.
I got distracted by Flash Attention when people asked for an elaboration, but the core reason this is true is that that's not where most of the operations are at scale. The attached image shows a breakdown of the operations. Image
While it is conceptually possible to live in a world in which h^2 L scales slower than (v+s)h, in practice you simply don't. In LLaMA 2, hL =655,360. You need a sequence length in the hundreds of thousands to even show up beyond a rounding error in the equation for parameters.
Read 6 tweets
Jan 1
Many people seem to think they can't do interesting LLM research outside a large lab, or are shoehorned into crowded topics. In reality, there are tons of wide-open high value questions. To prove it, I'll be tweeting one per week (every Monday) in 2024.

Please steal my ideas!
The vast majority of these questions can be studied on a couple commercial GPUs or a TRC grant. If you'd like to work on one of these but desire mentorship, I'm open to helping if you show you've put some effort into getting started / have preliminary results.
I made this list of questions in about three hours of work over the course of a week. I threw out another ~10 to trim the list down to 52.

SHA3-512: 95736531FB2A43256827D139A7A87EFD2C88FF9CDD43FFACE308880E8117A6DCE89FC07CD3276792B8F7D67886C9EE5311F961E42905FEF6B59D00A85E6DA357
Read 26 tweets
Sep 29, 2023
This is your daily reminder that only three orgs have ever trained a LLM and released the model and full data: @AiEleuther @BigscienceW (non-OS license) @togethercompute.

Small orgs like these make science possible in the face of industry power.
Transparency is a key part of both scientific research and ethical development and deployment of AI technologies. Without transparency into training data we cannot know whose information and ideologies are being encoded in ML systems. Unfortunately, this work is increasingly hard
Very reasonable frustration with the exploitation of labor for commercial profit is being wrongly directed against research organizations. There's no clearer sign of this than the group responsible for taking the Pile down calling for greater transparency

rettighedsalliancen.com/the-books3-cas…
Read 10 tweets
Apr 5, 2023
Have you ever wanted to do an experiment on LLMs and found that none of the existing model suites met your needs? At @AiEleuther we got tired of this happening and so designed a model suite that centers enabling scientific research as its primary goal

arxiv.org/abs/2304.01373
To do this we identified common limitations for doing research, such as training on non-public data, not releasing partially trained checkpoints, or not being able to easily know which data has been seen by which model checkpoints. Image
Since our goal is to enable scientific research, we decided to control for as much as possible. Drawing on recent (these models were trained last year) LLMs like PaLM, OPT, and GPT-NeoX-20B we followed best practices for large models such as GPT-J residuals, RoPE, and Flash Attn Image
Read 14 tweets
Mar 28, 2023
Recently I’ve been harping on how “compute optimal” and “best for use” are completely different things. This plot from @CerebrasSystems shows that really well: their compute optimal models trained on the Pile out-preform Pythia for fixed compute but underperform for fixed params
@CerebrasSystems Pythia models are trained for 300B tokens, while Cerebras’s are compute optimal. As a result, the validation loss for our 2.7B model is virtually identical to their 6.7B model and our 410M model is substantially better than their 1.3B model.
@CerebrasSystems Similar patterns exist for downstream loss as well (note these plots are flipped, as higher = better for accuracy). Unfortunately these plots don’t have model size labels, but you can figure them out by looking at the previous plot
Read 6 tweets
Feb 25, 2023
“Chinchilla optimal” means “I have a fixed TOTAL NUMBER OF FLOPS to spend. What model size and data size should I use to get the lowest loss?”

If you have a limit on either data or model size, then a chinchilla optimal model is likely not optimal for you.
Chinchilla-optimal models are very often ACTIVELY BAD FOR APPLICATIONS. A chinchilla optimal 2.7B model has seen only 50B tokens, or one sixth what EleutherAI typically trains small models for. A model trained for so few tokens might be “compute optimal” but it’s very bad.
“Overtrained” models are almost always better (for fixed parameter counts) than “optimally trained” ones. I am unaware of any evidence that has ever been put forth showing performance decreasing on a 6B+ model due to training “too much”
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(