ASAPP Profile picture
25 Feb, 21 tweets, 9 min read
The size of #NLP models have increased enormously, growing to millions, or even billions, of parameters, along with a significant increase in the financial cost and carbon emissions. ASAPP Reducing the High Cos...
The cost associated with training large models limits the #AIresearch community's ability to innovate, because a research project often needs a lot of experimentation.
Consider training a top-performing #LanguageModel on the Billion Word benchmark. A single experiment would take 384 GPU days (6 days*64 V100 #GPUs, or as much as $36,000 using AWS on-demand instances)
That high cost of building such #NLP models hinders their use in real-world business and makes monetization of #AI & NLP technologies more difficult.
The increasing #computation time and cost highlights the importance of inventing computationally efficient models that retain top #modeling power with reduced or accelerated computation.
The #Transformer architecture was proposed to accelerate model training in #NLP. It's built entirely on self-attention and avoids use of recurrence. The rationale, as mentioned in the original work, is to enable strong parallelization—utilizing full power of GPUs +TPUs
I should have mentioned there's a new paper out on #arXiv 'When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute.' arxiv.org/abs/2102.12459
The attention mechanism is a powerful component that permits efficient modeling of variable-length inputs. These advantages have made #Transformer an expressive and efficient unit for #NLP.
Two interesting questions arises following the development of #Transformer:
Is attention all we need for #modeling?
If recurrence is not a compute bottleneck, can we find better architectures?
We present #SRU++ as a possible answer to the above. github.com/asappresearch/…
Previous works have tackled parallelization/speed problem of RNNs and proposed various fast recurrent networks. Quasi-RNN and Simple Recurrent Unit (SRU), both are highly-parallelizable RNNs. The advance eliminates the need of eschewing recurrences to trade training efficiency.
Recent works achieved strong results by using recurrence in conjunction with self-attention. @Smerity demonstrated a single-headed attention LSTM (SHA-LSTM) is sufficient to achieve competitive results on character-level language modeling task while requiring less training time.
In addition, RNNs have been incorporated into #Transformer architectures, resulting in better results on #MachineTranslation and natural language understanding tasks These results suggest that recurrence and attention are complementary at sequence modeling.
In light of the previous research, we enhance the modeling capacity of SRU by incorporating self-attention as part of the architecture. An illustration of SRU and ...
We evaluate SRU++ on several language modeling benchmarks such as Enwik8 dataset. Compared to Transformer models such as #Transformer-XL, SRU++ can achieve similar results using only a fraction of the resources.
We compare the training efficiency between the two with directly comparable training settings.
SRU++ is 8.7x more efficient to surpass the dev result of Transformer-XL, and 5.1x more efficient to reach a BPC (bits-per-character) of 1.17. Dev BPC on Enwik8 dataset v...
We further compare the training cost of SRU++ and reported costs of leading Transformer-based models on Enwik8 and Wiki-103 datasets. Our model can achieve over 10x cost reduction while still outperforming the baseline models on test perplexity or BPC. Comparison of reported trai...
SRU++ is a highly efficient neural architecture and little attention is needed given recurrence. Similar to @Smerity's observation, we found using a couple of attention layers sufficient for state-of-the-art results. Here's an analysis by only ...
We present a recurrent architecture with optional built-in self-attention that achieves leading model capacity and training efficiency. We demonstrate that highly expressive and efficient models can be derived using a combination of attention and fast recurrence.
Our results reaffirm the empirical observations that attention is not all we need and can be complemented by other sequential modeling modules.
For further reading, ASAPP also conducts research to reduce the cost of model inference. See our published work on model distillation: aclweb.org/anthology/2020… and pruning: aclweb.org/anthology/2020…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with ASAPP

ASAPP Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!