Tokenizers have many drawbacks:
- Finite, fixed vocabulary - often can't process new/unseen languages
- Lack of robustness to missspeling and n o i s e
- Not learned "end-to-end"
- Giant vocabulary matrices in the multilingual setting
- Lots of technical debt in practice
(2/9)
Operating on the raw byte sequence used to represent text (e.g. UTF-8) solves many of the aforementioned issues. The main drawback: Sequence lengths tend to increase significantly compared to using token sequences.
(3/9)
Past work on byte-level models has therefore mainly focused on architectural innovations (convolutional front-ends, downsampling, etc.) to mitigate the increased computational cost that comes along with longer byte-level sequences.
(4/9)
With ByT5, we instead ask: What are the minimal changes to turn a token-to-token model (mT5) into a reasonably efficient byte-level model? Turns out it's basically
1. Make the encoder bigger and decoder smaller 2. Use a longer span mask lengths during (MLM) pre-training
(5/9)
The above changes result in a model that
- performs about as well as its token-level counterpart on "normal" tasks
- performs *dramatically* better on tasks dealing with pronunciation and noisy text
- is not dramatically slower, especially on tasks with short outputs
(6/9)
Some positive results to highlight:
- Much better on TweetQA
- Boosts on XTREME in gold + translate-train settings
- Noise like raNDOm CaSe causes ~1% performance degradation in ByT5 vs. ~25% for mT5
- ByT5-Small/Base/Large are about as fast as mT5 on XNLI
(7/9)
On the negative side, the larger variants of ByT5 are quite a lot slower at inference time on tasks that require generating long outputs (for example, ByT5-XXL is ~7x slower than mT5-XXL on XSUM). Good avenues for future work!
Our first talk will be by Thomas Margoni, who will provide some legal perspective on the use of web data for training large language models. He'll touch on topics like copyright law, rights, and licenses, as they pertain to training data for LMs. (2/14)
Then, @JesseDodge will give a talk on how to document datasets and improve reproducibility of research. He'll discuss the NLP reproducibility checklist, a recent study on documenting C4, and a framework for modeling bias in data. (3/14)
I recently have had a number of aspiring ML researchers ask me how to stay on top of the paper onslaught. Here are three concrete tips: 1) Pick a tiny subfield to focus on 2) Skim 3) Rely on your community
Thread to explain ⬇️ (1/5)
1) Pick a tiny subfield to focus on
It's impossible to stay on top of "all of ML". It's a gigantic and diverse field. Being an effective researcher requires laser-focusing on a subfield. Pick a problem that is important, excites you, and you feel you could make progress on. (2/5)
2) Skim
You'll find that many papers within your subfield of choice have a lot in common - there is often only a small nugget of novelty in each paper. It's incredibly important to develop your ability to find this nugget as quickly as possible. (3/5)
In case you missed our #neurips poster on MixMatch (arxiv.org/abs/1905.02249) today because you aren't in Vancouver or didn't survive the poster session stampede, here's the PDF: github.com/google-researc… and here's a transcript of what I said to everyone who came by: ⬇️ 1/11
The goal in semi-supervised learning (SSL) is to use unlabeled data to improve a model's performance. Many approaches do this by using the model to produce "label guesses" for unlabeled data, and then training the model to predict those guesses. 2/11
Two common ingredients for producing label guesses are consistency regularization ("When I perturb the input or model, the model's prediction shouldn't change.") and entropy minimization ("The model should output low-entropy/confident predictions on unlabeled data.") 3/11
New paper! We perform a systematic study of transfer learning for NLP using a unified text-to-text model, then push the limits to achieve SoTA on GLUE, SuperGLUE, CNN/DM, and SQuAD.
Paper: arxiv.org/abs/1910.10683
Code/models/data/etc: git.io/Je0cZ
Summary ⬇️ (1/14)
Our approach casts *every* language problem as a text-to-text task. For example, English-to-German translation -- input: "translate English to German: That is good." target: "Das ist gut." or sentiment ID -- input: "sentiment: This movie is terrible!", target: "negative" (2/14)
The text-to-text approach allows us to use the same model, loss function, decoding process, training procedure, etc. across every task we study. It also provides a standard testbed for the many ideas we evaluate in our empirical survey. (3/14)
If you are reeling from a NeurIPS rejection or stressing about an ICLR submission, remember that some of the best papers were never published anywhere except arxiv. Thread of a few favorites (1/5):
"Generating Sequences with RNNs" by Graves arxiv.org/abs/1308.0850 This paper blew my mind when it came out, showing that it was possible to generate plausible text and handwriting with RNNs. Includes the predecessors of attention, Adam, etc... (2/5)
WaveNet by van den Oord et al. arxiv.org/abs/1609.03499 Until this came out I don't think most of us expected that we'd be able to generate raw waveforms with deep networks anytime soon. The results were surprisingly good and the architecture remains influential. (3/5)