Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Stella Biderman

Apr 4, 2022 • 16 tweets • 9 min read • Read on X

Scrolly

Google decided that 137B and 280B weren't enough, so now they've gone and trained a 540B model.

ai.googleblog.com/2022/04/pathwa…

Chinchilla is *hugely* punching above its weight here. Damn.

@SashaMTL

@SashaMTL @TaliaRinger Hmmmm I coulda sworn I recently read something about how LLMs are Good for the Environment Actually (TM) because they're multitask models and one training run supports a lot of deployment, and yet here we are.

@SashaMTL

@SashaMTL @TaliaRinger Has anyone studied the value of drawing straight lines on log-log plots through three data points? We see stuff like this all the time, but using these plots to interpolate seems extremely suspicious to me.

@SashaMTL

@SashaMTL @TaliaRinger It's cool to see Ben and @arankomatsuzaki's parallel layers get picked up outside of #EleutherAI!

"RoPE has been shown to have better performance on long sequence lengths" is a bit of a funny thing to read since @OfirPress's ALIBI seems to beat it on this specific task.

Axillary z-loss shows up again. Has anyone experimented with this? I haven't, but it seems like it's gaining traction.

This would be a god-send in a hypothetical world in which people have access to the checkpoints. Expect this to power some interesting papers coming out of Google about training dynamics in the near future though.

@StasBekman

Some interesting info contra GPT-NeoX and the Big Science Research workshop paper on training instability. Both papers claim spikes are caused due to bad data. @StasBekman do you recall ever running this experiment? I didn't, though now I'm interested in seeing if I can.

@StasBekman

@StasBekman Evaluating memorization is continuing to make it into the mainstream reporting on language models.

@StasBekman

@StasBekman Very cool to see them reproduce Carlini et al.'s equation so closely. The actual scaling law curves are usually pretty sensitive to the model choice, but here it's not. Given the architectural similarity, maybe this really is a big EleutherAI model?

arxiv.org/abs/2202.07646

@StasBekman

@StasBekman I'm mostly joking. That caught my attention, but drawing a conclusion like that at this stage would be highly premature. It would be very interesting if memorization scaling laws transferred though, especially across models trained on different datasets. 🤔🤔🤔

@StasBekman

@StasBekman BIG Bench was designed to be very hard for current LLMs to solve, but it may last less time than intended... performance is clearly shifting towards the AI. I'm pretty curious where my task (indentify_math_theorems) falls on this plot, but I don't see the raw data anywhere.

@StasBekman

@StasBekman Check out the benchmark here: github.com/google/BIG-ben…

Yes, I know it's hard to use if your model isn't in TensorFlow. I would like to get that fixed but it's not my project. I might take a stab at integrating it with the EleutherAI LM Eval Harness again soon though.

@StasBekman

@StasBekman Everything in @emilymbender's thread here is spot on. It's also worth pointing out some math people regularly slip up on: 30% of documents have names, 10% of names are in BookCorpus -> 26.7% of names are not in BC.

https://twitter.com/emilymbender/status/1511123935085006848?s=20&t=NiG3T46EHkuiTg5NU4oROg

@StasBekman

@StasBekman @emilymbender Yes I know those numbers don't add up, it's actually 11% in BC I just mistyped. If it were 10%, then 27% of names would not be in BC.

Big OOF

https://twitter.com/emilymbender/status/1511129363210670082?s=20&t=NiG3T46EHkuiTg5NU4oROg

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @BlancheMinerva

Stella Biderman

@BlancheMinerva

Feb 7

https://twitter.com/7oponaut/status/1887679663419629912

I had my first aha moment with OpenAI when it leaked that they had spent a year lying about that their API models being RLHF when they were really SFT.

My second was when they sent anonymous legal threats to people in the OSS AI community who had GPT-4 details leaked to them.

https://twitter.com/7oponaut/status/1887679663419629912

OpenAI had made choices I disagreed with and did things I didn't like before then, but those were the key moments driving my current attitude towards them.

Honorary mention to when I got blacklisted from meetings with OpenAI because I talked about them lying about the RLHF stuff on Twitter and it hurt Jan's feelings. My collaborators were told that the meeting would be cancelled unless I didn't come.

Read 14 tweets

Stella Biderman

@BlancheMinerva

Feb 20, 2024

https://twitter.com/deliprao/status/1760009083984167090

"The amount of FLOPs it requires to train a LLM grows quadratically with sequence length" is a false statement for all practical purposes and cannot die quickly enough.

https://twitter.com/deliprao/status/1760009083984167090

I got distracted by Flash Attention when people asked for an elaboration, but the core reason this is true is that that's not where most of the operations are at scale. The attached image shows a breakdown of the operations.

While it is conceptually possible to live in a world in which h^2 L scales slower than (v+s)h, in practice you simply don't. In LLaMA 2, hL =655,360. You need a sequence length in the hundreds of thousands to even show up beyond a rounding error in the equation for parameters.

Read 6 tweets

Stella Biderman

@BlancheMinerva

Jan 1, 2024

Many people seem to think they can't do interesting LLM research outside a large lab, or are shoehorned into crowded topics. In reality, there are tons of wide-open high value questions. To prove it, I'll be tweeting one per week (every Monday) in 2024.

Please steal my ideas!

The vast majority of these questions can be studied on a couple commercial GPUs or a TRC grant. If you'd like to work on one of these but desire mentorship, I'm open to helping if you show you've put some effort into getting started / have preliminary results.

I made this list of questions in about three hours of work over the course of a week. I threw out another ~10 to trim the list down to 52.

SHA3-512: 95736531FB2A43256827D139A7A87EFD2C88FF9CDD43FFACE308880E8117A6DCE89FC07CD3276792B8F7D67886C9EE5311F961E42905FEF6B59D00A85E6DA357

Read 26 tweets

Stella Biderman

@BlancheMinerva

Sep 29, 2023

https://twitter.com/taliaringer/status/1707541040016642184

This is your daily reminder that only three orgs have ever trained a LLM and released the model and full data: @AiEleuther @BigscienceW (non-OS license) @togethercompute.

Small orgs like these make science possible in the face of industry power.

https://twitter.com/taliaringer/status/1707541040016642184

Transparency is a key part of both scientific research and ethical development and deployment of AI technologies. Without transparency into training data we cannot know whose information and ideologies are being encoded in ML systems. Unfortunately, this work is increasingly hard

Very reasonable frustration with the exploitation of labor for commercial profit is being wrongly directed against research organizations. There's no clearer sign of this than the group responsible for taking the Pile down calling for greater transparency

rettighedsalliancen.com/the-books3-cas…

Read 10 tweets

Stella Biderman

@BlancheMinerva

Apr 5, 2023

@AiEleuther

Have you ever wanted to do an experiment on LLMs and found that none of the existing model suites met your needs? At @AiEleuther we got tired of this happening and so designed a model suite that centers enabling scientific research as its primary goal

arxiv.org/abs/2304.01373

To do this we identified common limitations for doing research, such as training on non-public data, not releasing partially trained checkpoints, or not being able to easily know which data has been seen by which model checkpoints.

Since our goal is to enable scientific research, we decided to control for as much as possible. Drawing on recent (these models were trained last year) LLMs like PaLM, OPT, and GPT-NeoX-20B we followed best practices for large models such as GPT-J residuals, RoPE, and Flash Attn

Read 14 tweets

Stella Biderman

@BlancheMinerva

Mar 28, 2023

@CerebrasSystems

Recently I’ve been harping on how “compute optimal” and “best for use” are completely different things. This plot from @CerebrasSystems shows that really well: their compute optimal models trained on the Pile out-preform Pythia for fixed compute but underperform for fixed params

@CerebrasSystems

@CerebrasSystems Pythia models are trained for 300B tokens, while Cerebras’s are compute optimal. As a result, the validation loss for our 2.7B model is virtually identical to their 6.7B model and our 410M model is substantially better than their 1.3B model.

@CerebrasSystems

@CerebrasSystems Similar patterns exist for downstream loss as well (note these plots are flipped, as higher = better for accuracy). Unfortunately these plots don’t have model size labels, but you can figure them out by looking at the previous plot

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Stella Biderman

Try unrolling a thread yourself!

More from @BlancheMinerva

Stella Biderman

Stella Biderman

Stella Biderman

Stella Biderman

Stella Biderman

Stella Biderman

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!