I like tokens! I lead the Olmo data team at @allen_ai w/ @kylelostat. Open source is fun 🤖☕️🍕🏳️🌈 Opinions sampled from my own stochastic parrot.
Nov 20 • 5 tweets • 3 min read
This release has SO MUCH
• New pretrain corpus, new midtrain data, 380B+ long context tokens
• 7B & 32B, Base, Instruct, Think, RL Zero
• Close to Qwen 3 performance, but fully open!!
We did a lot of work on algorithmic mixing and upsampling! No more of manually tweaking %, we use proxy models and a robust eval suite to learn optimal distribution
Jan 3 • 6 tweets • 4 min read
OLMo 2 tech report is out
We get in the weeds with this one, with 50+ pages on 4 crucial components of LLM development pipeline:
💪 Stability
How do you ensure that your pretrain run doesn't blow up?
We spent months perfecting our OLMo recipe! it takes many targeted mitigations and expensive experiments to get it right… well now you don't have to!
May 10, 2023 • 9 tweets • 3 min read
PaLM v2 is out! Join me as I read the technical report (ai.google/static/documen…) for pretraining data insights 👇🧵
First, PaLM v2 is trained on mixture of web/books/code/conversational data; it uses English and non-English text
Jul 8, 2020 • 4 tweets • 2 min read
Look, I appreciate the spirit of this work, but non-binary erasure shouldn't have any place at #acl2020nlp
This work makes my blood boil.
aclweb.org/anthology/2020…
NB folx are **not** a variable that you can just throw away for the sake of simplifying your analysis.
And don't get me started of gender labeling individual based on their names.