12,399 views

Tim Dettmers

@Tim_Dettmers

, 14 tweets, 3 min read

My Authors

How can you successfully train transformers on small datasets like PTB and WikiText-2? Are LSTMs better on small datasets? I ran 339 experiments worth 568 GPU hours and came up with some answers. I do not have time to write a blog post, so here a twitter thread instead. 1/n

To give a bit background: All this came about by my past frustration with replicating Transformer-XL results on PTB and having very poor results on WikiText-2 (WT2). On WT2, my best model after 200+ experiments was 90ish ppl which is far from standard LSTM baselines (65.8 ppl).

https://twitter.com/srush_nlp/status/1245825437240102913

https://twitter.com/srush_nlp/status/1245825437240102913

Some friends told me that they also tried it and failed to replicate Transformer-XL results. We also did not get further information from the authors so we gave up on the replication. When I replied to a tweet about these results that ended in this:

https://twitter.com/srush_nlp/status/1245825437240102913

https://twitter.com/ZihangDai/status/1245905407350112256

https://twitter.com/ZihangDai/status/1245905407350112256

One of the authors graciously responded with their private code and from there it was easy to figure out the puzzle. It turns out that you can easily train transformers on small datasets when you use tricks (and have the patience to train a very long time)

https://twitter.com/ZihangDai/status/1245905407350112256

The key insight is the following: In the small dataset regime, it is all about dataset augmentation. The analog in computer vision is that you get much better results, particularly on small datasets, if you do certain dataset augmentations. This also regularizes the model.

The most dramatic performance gain comes from discrete embedding dropout: You embed as usual, but now with a probability p you zero the entire word vector. This is akin to masked language modeling but the goal is not to predict the mask — just regular LM with uncertain context.

The second most important factor is regular input dropout: You take the embeddings and dropout elements with probability p. This also has a data augmentation effect very similar to dropping out random pixels for images. What is a good way to think about this? 1/2

Remember that we can do King-man+woman=Queen? Now imagine input dropout removes the "man" component of "King". This forces the model to distribute specific information (gender in this case) into multiple dimensions to improve generalization making it more robust. 2/2

Otherwise, it is a game of further regularization (more dropout + weight decay) and of patience. I can train a good model without these tricks in 15 minutes and get 97 ppl. If I apply all these dropouts the model underfits after 7h of training to 63.4 ppl (better than LSTM).

You can also apply these data augmentation recipes to large datasets, but nobody would like to train for months on WT-103 for a couple of ppl points. In my opinion, techniques that require so much extra compute are more harmful to the community than useful. 1/2

It is misleading if you report a bit better perplexity at the cost of 28x longer training while omitting how much extra computation is used. So if you see discrete embedding/input dropout without reports of training time that should be a big red flag! 2/2

Here the code changes to the public Transformer-XL repo that my results are based on: github.com/TimDettmers/tr…

With my changes to the public Transformer-XL repo, you can run this script to get down to 63.4 ppl on WT2: github.com/TimDettmers/tr…

I did not have time to make any graphs etc., but if you are interested in the sensitivity of some of these dropout parameters, please get in touch. I have some more insights and grid search result data that I can share.

Enjoying this thread?

Try unrolling a thread yourself!

Enjoying this thread?

Try unrolling a thread yourself!

Embed code for your website

Did Thread Reader help you today?