Thread by @BlancheMinerva on Thread Reader App

Google decided that 137B and 280B weren't enough, so now they've gone and trained a 540B model.

ai.googleblog.com/2022/04/pathwa…

Chinchilla is *hugely* punching above its weight here. Damn.

@SashaMTL @TaliaRinger Hmmmm I coulda sworn I recently read something about how LLMs are Good for the Environment Actually (TM) because they're multitask models and one training run supports a lot of deployment, and yet here we are.

@SashaMTL @TaliaRinger Has anyone studied the value of drawing straight lines on log-log plots through three data points? We see stuff like this all the time, but using these plots to interpolate seems extremely suspicious to me.

@SashaMTL @TaliaRinger It's cool to see Ben and @arankomatsuzaki's parallel layers get picked up outside of #EleutherAI!

"RoPE has been shown to have better performance on long sequence lengths" is a bit of a funny thing to read since @OfirPress's ALIBI seems to beat it on this specific task.

Axillary z-loss shows up again. Has anyone experimented with this? I haven't, but it seems like it's gaining traction.

This would be a god-send in a hypothetical world in which people have access to the checkpoints. Expect this to power some interesting papers coming out of Google about training dynamics in the near future though.

Some interesting info contra GPT-NeoX and the Big Science Research workshop paper on training instability. Both papers claim spikes are caused due to bad data. @StasBekman do you recall ever running this experiment? I didn't, though now I'm interested in seeing if I can.

@StasBekman Evaluating memorization is continuing to make it into the mainstream reporting on language models.

@StasBekman Very cool to see them reproduce Carlini et al.'s equation so closely. The actual scaling law curves are usually pretty sensitive to the model choice, but here it's not. Given the architectural similarity, maybe this really is a big EleutherAI model?

arxiv.org/abs/2202.07646

@StasBekman I'm mostly joking. That caught my attention, but drawing a conclusion like that at this stage would be highly premature. It would be very interesting if memorization scaling laws transferred though, especially across models trained on different datasets. 🤔🤔🤔

@StasBekman BIG Bench was designed to be very hard for current LLMs to solve, but it may last less time than intended... performance is clearly shifting towards the AI. I'm pretty curious where my task (indentify_math_theorems) falls on this plot, but I don't see the raw data anywhere.

@StasBekman Check out the benchmark here: github.com/google/BIG-ben…

Yes, I know it's hard to use if your model isn't in TensorFlow. I would like to get that fixed but it's not my project. I might take a stab at integrating it with the EleutherAI LM Eval Harness again soon though.

@StasBekman Everything in @emilymbender's thread here is spot on. It's also worth pointing out some math people regularly slip up on: 30% of documents have names, 10% of names are in BookCorpus -> 26.7% of names are not in BC.

@StasBekman @emilymbender Yes I know those numbers don't add up, it's actually 11% in BC I just mistyped. If it were 10%, then 27% of names would not be in BC.

Big OOF

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Share this page!

Enter URL or ID to Unroll