My Authors
Read all threads
How can you successfully train transformers on small datasets like PTB and WikiText-2? Are LSTMs better on small datasets? I ran 339 experiments worth 568 GPU hours and came up with some answers. I do not have time to write a blog post, so here a twitter thread instead. 1/n
To give a bit background: All this came about by my past frustration with replicating Transformer-XL results on PTB and having very poor results on WikiText-2 (WT2). On WT2, my best model after 200+ experiments was 90ish ppl which is far from standard LSTM baselines (65.8 ppl).
Some friends told me that they also tried it and failed to replicate Transformer-XL results. We also did not get further information from the authors so we gave up on the replication. When I replied to a tweet about these results that ended in this:
One of the authors graciously responded with their private code and from there it was easy to figure out the puzzle. It turns out that you can easily train transformers on small datasets when you use tricks (and have the patience to train a very long time)
The key insight is the following: In the small dataset regime, it is all about dataset augmentation. The analog in computer vision is that you get much better results, particularly on small datasets, if you do certain dataset augmentations. This also regularizes the model.
The most dramatic performance gain comes from discrete embedding dropout: You embed as usual, but now with a probability p you zero the entire word vector. This is akin to masked language modeling but the goal is not to predict the mask — just regular LM with uncertain context.
The second most important factor is regular input dropout: You take the embeddings and dropout elements with probability p. This also has a data augmentation effect very similar to dropping out random pixels for images. What is a good way to think about this? 1/2
Remember that we can do King-man+woman=Queen? Now imagine input dropout removes the "man" component of "King". This forces the model to distribute specific information (gender in this case) into multiple dimensions to improve generalization making it more robust. 2/2
Otherwise, it is a game of further regularization (more dropout + weight decay) and of patience. I can train a good model without these tricks in 15 minutes and get 97 ppl. If I apply all these dropouts the model underfits after 7h of training to 63.4 ppl (better than LSTM).
You can also apply these data augmentation recipes to large datasets, but nobody would like to train for months on WT-103 for a couple of ppl points. In my opinion, techniques that require so much extra compute are more harmful to the community than useful. 1/2
It is misleading if you report a bit better perplexity at the cost of 28x longer training while omitting how much extra computation is used. So if you see discrete embedding/input dropout without reports of training time that should be a big red flag! 2/2
Here the code changes to the public Transformer-XL repo that my results are based on: github.com/TimDettmers/tr…
With my changes to the public Transformer-XL repo, you can run this script to get down to 63.4 ppl on WT2: github.com/TimDettmers/tr…
I did not have time to make any graphs etc., but if you are interested in the sensitivity of some of these dropout parameters, please get in touch. I have some more insights and grid search result data that I can share.
Missing some Tweet in this thread? You can try to force a refresh.

Enjoying this thread?

Keep Current with Tim Dettmers

Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!