Stella Biderman Profile picture
Open source LLMs and interpretability research at @BoozAllen and @AiEleuther. My employers disown my tweets. She/her
2 subscribers
Feb 20 6 tweets 2 min read
"The amount of FLOPs it requires to train a LLM grows quadratically with sequence length" is a false statement for all practical purposes and cannot die quickly enough. I got distracted by Flash Attention when people asked for an elaboration, but the core reason this is true is that that's not where most of the operations are at scale. The attached image shows a breakdown of the operations. Image
Jan 1 26 tweets 8 min read
Many people seem to think they can't do interesting LLM research outside a large lab, or are shoehorned into crowded topics. In reality, there are tons of wide-open high value questions. To prove it, I'll be tweeting one per week (every Monday) in 2024.

Please steal my ideas! The vast majority of these questions can be studied on a couple commercial GPUs or a TRC grant. If you'd like to work on one of these but desire mentorship, I'm open to helping if you show you've put some effort into getting started / have preliminary results.
Sep 29, 2023 10 tweets 3 min read
This is your daily reminder that only three orgs have ever trained a LLM and released the model and full data: @AiEleuther @BigscienceW (non-OS license) @togethercompute.

Small orgs like these make science possible in the face of industry power. Transparency is a key part of both scientific research and ethical development and deployment of AI technologies. Without transparency into training data we cannot know whose information and ideologies are being encoded in ML systems. Unfortunately, this work is increasingly hard
Apr 5, 2023 14 tweets 9 min read
Have you ever wanted to do an experiment on LLMs and found that none of the existing model suites met your needs? At @AiEleuther we got tired of this happening and so designed a model suite that centers enabling scientific research as its primary goal

arxiv.org/abs/2304.01373 To do this we identified common limitations for doing research, such as training on non-public data, not releasing partially trained checkpoints, or not being able to easily know which data has been seen by which model checkpoints. Image
Mar 28, 2023 6 tweets 4 min read
Recently I’ve been harping on how “compute optimal” and “best for use” are completely different things. This plot from @CerebrasSystems shows that really well: their compute optimal models trained on the Pile out-preform Pythia for fixed compute but underperform for fixed params @CerebrasSystems Pythia models are trained for 300B tokens, while Cerebras’s are compute optimal. As a result, the validation loss for our 2.7B model is virtually identical to their 6.7B model and our 410M model is substantially better than their 1.3B model.
Feb 25, 2023 5 tweets 4 min read
“Chinchilla optimal” means “I have a fixed TOTAL NUMBER OF FLOPS to spend. What model size and data size should I use to get the lowest loss?”

If you have a limit on either data or model size, then a chinchilla optimal model is likely not optimal for you. Chinchilla-optimal models are very often ACTIVELY BAD FOR APPLICATIONS. A chinchilla optimal 2.7B model has seen only 50B tokens, or one sixth what EleutherAI typically trains small models for. A model trained for so few tokens might be “compute optimal” but it’s very bad.
Dec 15, 2022 12 tweets 3 min read
THIS DOES NOT WORK. Don’t fall for this disinformation and destroy your websites and communities. This protest has no bearing on the performance of DALL-E2 and Stable Diffusion. It’s incredibly sad to see a basic lack of knowledge of technology enable shit like this to go viral. Yes, models like DALL-E2, Stable Diffusion, and Midjourney were trained on images uploaded to crowdsourced websites like Flickr and ArtStation.

HOWEVER post-release changes to these websites do not influence the AIs in any way. They don’t retrieve images on the fly.
Oct 25, 2022 9 tweets 7 min read
ITT: an OAI employee admits that the text-davinci API models are not from their papers.

Until @OpenAI actually documents the connection between the models in their papers and the models released via APIs, #NLProc researchers need to stop using them to do research. @OpenAI This is not a minor point either. Apparently the text-davinci-002 API “is an instruct model. It doesn't uses a similar but slightly different [sic] training technique but it's not derived from davinci. Hence it's not a fair comparison.”
Apr 20, 2022 12 tweets 16 min read
Over a year ago, several brilliant people at #EleutherAI started plugging VQGAN and CLIP together and getting it to generate images. By now there are many variations and adaptations of the technique out there, but for various reasons the OG paper is only just coming out Huge props to @RiversHaveWings, @dashstander, @EricHallahan, @lcastricato, and the many other people who have iterated on and popularized this technique. I came rather late to the party, and mostly made sure that the experiments happened and their great work was showcased
Apr 4, 2022 16 tweets 9 min read
Google decided that 137B and 280B weren't enough, so now they've gone and trained a 540B model.

ai.googleblog.com/2022/04/pathwa… Chinchilla is *hugely* punching above its weight here. Damn.
Feb 19, 2022 4 tweets 4 min read
Phenomenal work on the linkage between LM performance and frequency of data in the pretraining dataset. As far as I am aware, this is the first paper to demonstrate such a connection outside of the work of people like @colinraffel and @katherine1ee and Carlini on memorization To their credit, @OpenAI put this plot in their GPT-3 which looks like this. It appears to answer the question, but recent work (esp. @AlexTamkin’s newest paper) calls into question the validity of using a present / not present dichotomy to draw conclusion.
Jan 20, 2022 14 tweets 6 min read
Excited to share my newest paper, "Neural Language Models are Effective Plagiarists" with @EdwardRaffML. We took a dataset of CS 101 assignments and asked "can a language model do a good job solving these with minimal human intervention or knowledge?"

arxiv.org/abs/2201.07406 @EdwardRaffML There's been some very interesting work recently on solving college level assignments with transformers, but that work typically uses private models and more complicated pipelines. We wanted to focus on what was available to a random student with the internet, not an AI expert.
Oct 11, 2021 32 tweets 19 min read
@MSFTResearch and @NVIDIAAI announce a 540B parameter large language model, 3x larger than GPT-3, achieving superior results on a variety of tasks. Trained on the Pile and evaluated on the Eval Harness, two of #EleutherAI’s biggest projects.

A 🧵

developer.nvidia.com/blog/using-dee… @MSFTResearch @NVIDIAAI The Pile is a curated dataset of high quality data for training language models. The project was lead by @nabla_theta and myself, with contribs from many others. Released on Jan 1st 2021, it was the first public massive language model training dataset

Aug 23, 2021 60 tweets 33 min read
Okay, time to live tweet my thoughts on @stanfordnlp @StanfordAILab's "Workshop on Foundation Models." A long thread. First and foremost: please never use the phrase "foundational models" every again. It's a garbage name that people like @mmitchell_ai @emilymbender @mer__edith have criticized at length. I'll go find some of their comments and link to them later, but the short version is:
Jul 3, 2021 26 tweets 9 min read
Phenomenally interesting paper about how AI researchers talk about what they value in their research. Very glad the authors took the time to do this laborious but important work. I'm going to keep this in my desk so the next time I go on a rant about how ML is prescriptive [1/?] rather than descriptive I can wack people who disagree with this paper 😛

I would actually go further than the authors of this paper do (I don't know if they disagree with what I'm about to say, but they didn't say it): I would say that corporate AI research [2/?]
Jul 2, 2021 6 tweets 2 min read
Jul 2, 2021 9 tweets 6 min read
Great write up about the crazy cool art #EleutherAI members have been learning to coax out of GANs with CLIP! Credit assignment with stuff like this is hard, but @jbusted1 @RiversHaveWings @BoneAmputee and @kialuy are some of the people who have made this happen. @jbusted1 @RiversHaveWings @BoneAmputee @kialuy They’ve been doing some visionary work with human-guided AI-generated art for the past two months, and it’s phenomenal that they’re starting to get the recognition they deserve. Several more people who either lack twitters or whose handles I don’t know deserve applause too