Alisa Liu Profile picture
PhD student at @uwcse @uwnlp
Mar 21 8 tweets 3 min read
We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵 Segmentation of the sentence "By the way, I am a fan of the Milky Way" under BPE and SuperBPE. This started with a curiosity💡: why do all LLMs limit tokens to *parts* of whitespace-delimited words? After all, many word sequences (e.g. “by the way”) function as single units. Different languages can also express the same meaning in one or several words.
Jan 16, 2022 6 tweets 3 min read
We introduce a new paradigm for dataset creation based on human 🧑‍💻 and machine 🤖 collaboration, which brings together the generative strength of LMs and the evaluative strength of humans. And we collect 🎉 WaNLI, a dataset of 108K NLI examples! 🧵

Paper: swabhs.com/assets/pdf/wan… Diagram of the worker-AI collaboration pipeline. Starting wi Our pipeline starts with an existing dataset (MNLI), and uses data maps 📜 to automatically identify pockets of examples that demonstrate challenging 🧐 reasoning patterns relative to a trained model. Then we use GPT-3 to generate new examples likely to have the same pattern. 2/ This table contains examples of MNLI seed examples, along wi