Chinchilla is *hugely* punching above its weight here. Damn.
@SashaMTL@TaliaRinger Hmmmm I coulda sworn I recently read something about how LLMs are Good for the Environment Actually (TM) because they're multitask models and one training run supports a lot of deployment, and yet here we are.
@SashaMTL@TaliaRinger Has anyone studied the value of drawing straight lines on log-log plots through three data points? We see stuff like this all the time, but using these plots to interpolate seems extremely suspicious to me.
"RoPE has been shown to have better performance on long sequence lengths" is a bit of a funny thing to read since @OfirPress's ALIBI seems to beat it on this specific task.
Axillary z-loss shows up again. Has anyone experimented with this? I haven't, but it seems like it's gaining traction.
This would be a god-send in a hypothetical world in which people have access to the checkpoints. Expect this to power some interesting papers coming out of Google about training dynamics in the near future though.
Some interesting info contra GPT-NeoX and the Big Science Research workshop paper on training instability. Both papers claim spikes are caused due to bad data. @StasBekman do you recall ever running this experiment? I didn't, though now I'm interested in seeing if I can.
@StasBekman Evaluating memorization is continuing to make it into the mainstream reporting on language models.
@StasBekman Very cool to see them reproduce Carlini et al.'s equation so closely. The actual scaling law curves are usually pretty sensitive to the model choice, but here it's not. Given the architectural similarity, maybe this really is a big EleutherAI model?
@StasBekman I'm mostly joking. That caught my attention, but drawing a conclusion like that at this stage would be highly premature. It would be very interesting if memorization scaling laws transferred though, especially across models trained on different datasets. 🤔🤔🤔
@StasBekman BIG Bench was designed to be very hard for current LLMs to solve, but it may last less time than intended... performance is clearly shifting towards the AI. I'm pretty curious where my task (indentify_math_theorems) falls on this plot, but I don't see the raw data anywhere.
Yes, I know it's hard to use if your model isn't in TensorFlow. I would like to get that fixed but it's not my project. I might take a stab at integrating it with the EleutherAI LM Eval Harness again soon though.
@StasBekman Everything in @emilymbender's thread here is spot on. It's also worth pointing out some math people regularly slip up on: 30% of documents have names, 10% of names are in BookCorpus -> 26.7% of names are not in BC.
@StasBekman@emilymbender Yes I know those numbers don't add up, it's actually 11% in BC I just mistyped. If it were 10%, then 27% of names would not be in BC.
"The amount of FLOPs it requires to train a LLM grows quadratically with sequence length" is a false statement for all practical purposes and cannot die quickly enough.
I got distracted by Flash Attention when people asked for an elaboration, but the core reason this is true is that that's not where most of the operations are at scale. The attached image shows a breakdown of the operations.
While it is conceptually possible to live in a world in which h^2 L scales slower than (v+s)h, in practice you simply don't. In LLaMA 2, hL =655,360. You need a sequence length in the hundreds of thousands to even show up beyond a rounding error in the equation for parameters.
Many people seem to think they can't do interesting LLM research outside a large lab, or are shoehorned into crowded topics. In reality, there are tons of wide-open high value questions. To prove it, I'll be tweeting one per week (every Monday) in 2024.
Please steal my ideas!
The vast majority of these questions can be studied on a couple commercial GPUs or a TRC grant. If you'd like to work on one of these but desire mentorship, I'm open to helping if you show you've put some effort into getting started / have preliminary results.
I made this list of questions in about three hours of work over the course of a week. I threw out another ~10 to trim the list down to 52.
This is your daily reminder that only three orgs have ever trained a LLM and released the model and full data: @AiEleuther @BigscienceW (non-OS license) @togethercompute.
Small orgs like these make science possible in the face of industry power.
Transparency is a key part of both scientific research and ethical development and deployment of AI technologies. Without transparency into training data we cannot know whose information and ideologies are being encoded in ML systems. Unfortunately, this work is increasingly hard
Very reasonable frustration with the exploitation of labor for commercial profit is being wrongly directed against research organizations. There's no clearer sign of this than the group responsible for taking the Pile down calling for greater transparency
Have you ever wanted to do an experiment on LLMs and found that none of the existing model suites met your needs? At @AiEleuther we got tired of this happening and so designed a model suite that centers enabling scientific research as its primary goal
To do this we identified common limitations for doing research, such as training on non-public data, not releasing partially trained checkpoints, or not being able to easily know which data has been seen by which model checkpoints.
Since our goal is to enable scientific research, we decided to control for as much as possible. Drawing on recent (these models were trained last year) LLMs like PaLM, OPT, and GPT-NeoX-20B we followed best practices for large models such as GPT-J residuals, RoPE, and Flash Attn
Recently I’ve been harping on how “compute optimal” and “best for use” are completely different things. This plot from @CerebrasSystems shows that really well: their compute optimal models trained on the Pile out-preform Pythia for fixed compute but underperform for fixed params
@CerebrasSystems Pythia models are trained for 300B tokens, while Cerebras’s are compute optimal. As a result, the validation loss for our 2.7B model is virtually identical to their 6.7B model and our 410M model is substantially better than their 1.3B model.
@CerebrasSystems Similar patterns exist for downstream loss as well (note these plots are flipped, as higher = better for accuracy). Unfortunately these plots don’t have model size labels, but you can figure them out by looking at the previous plot
Chinchilla-optimal models are very often ACTIVELY BAD FOR APPLICATIONS. A chinchilla optimal 2.7B model has seen only 50B tokens, or one sixth what EleutherAI typically trains small models for. A model trained for so few tokens might be “compute optimal” but it’s very bad.
“Overtrained” models are almost always better (for fixed parameter counts) than “optimally trained” ones. I am unaware of any evidence that has ever been put forth showing performance decreasing on a 6B+ model due to training “too much”