Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Alisa Liu

@alisawuffles

Mar 21 • 8 tweets • 3 min read • Read on X

Scrolly

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

This started with a curiosity💡: why do all LLMs limit tokens to *parts* of whitespace-delimited words? After all, many word sequences (e.g. “by the way”) function as single units. Different languages can also express the same meaning in one or several words.

E.g. “math teacher” = “Mathelehrer” in German. At the extreme, Chinese *doesn’t use whitespace at all*, so its tokens can span many words — yet this has seemingly not hindered LMs like @deepseek_ai from learning it!

What can we gain from less restrictive tokenization? To find out, we developed SuperBPE🚀, which learns subword *and* superword tokens. SuperBPE dramatically improves encoding efficiency over BPE — at a fixed vocab size of 200k, SuperBPE reduces sequence length by 33% on average!

Then we pretrain 8B models from scratch with BPE and SuperBPE🚀, fixing everything about the training setup except the tokenizer. We see +4% on avg📈 across 30 downstream tasks, and win on 25/30 of individual tasks, while also being 27% more efficient at inference time.

Why does SuperBPE🚀 work? We find that loss is distributed more uniformly over tokens in SuperBPE models. They are less overfit to high-frequency, easy-to-predict tokens (e.g. “way” after “By the”), and at the same time master a much broader set of language phenomena.

SuperBPE🚀 is a seamless replacement for BPE in modern LM development pipelines, requiring no changes to the model architecture or training framework. You can use it in HF right now!

Play around with our tokenizers here! superbpe.github.io 🚀
Paper: arxiv.org/abs/2503.13423
HF models & tokenizers: tinyurl.com/superbpe

This work would not have been possible w/o co-1st 🌟@jonathanhayase🌟, and @vjhofmann @sewoong79 @nlpnoah @yejinchoinka

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @alisawuffles

Alisa Liu

@alisawuffles

Jan 16, 2022

We introduce a new paradigm for dataset creation based on human 🧑‍💻 and machine 🤖 collaboration, which brings together the generative strength of LMs and the evaluative strength of humans. And we collect 🎉 WaNLI, a dataset of 108K NLI examples! 🧵

Paper: swabhs.com/assets/pdf/wan…

Our pipeline starts with an existing dataset (MNLI), and uses data maps 📜 to automatically identify pockets of examples that demonstrate challenging 🧐 reasoning patterns relative to a trained model. Then we use GPT-3 to generate new examples likely to have the same pattern. 2/

Next we propose a new metric, also inspired by data maps, to automatically filter generations for those most likely to aid model learning. Finally, we validate ✅ the generated examples through crowdworkers, who assign a gold label 🟡 and (optionally) revise for quality ✍️. 3/

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Alisa Liu

Try unrolling a thread yourself!

More from @alisawuffles

Alisa Liu

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!