Tweet

Tu Vu

15 Sep, 9 tweets, 4 min read

@lmthang

Excited to announce our #EMNLP2021 paper that shows how to turn a pre-trained language model or even a randomly initialized model into a strong few-shot learner.

Paper: arxiv.org/abs/2109.06270
w/ amazing collaborators: @lmthang, @quocleix, @GradySimon, @MohitIyyer

1/9👇

Despite their strong performance on many tasks, large-scale pre-trained language models do not perform as well when limited labeled data is available (e.g., on small datasets or in few-shot settings). Collecting more labeled data can help but can also be prohibitively expensive.

We propose STraTA, which stands for Self-Training with Task Augmentation, an approach that combines two complementary methods, task augmentation and self-training, to effectively leverage task-specific unlabeled data, which is comparatively cheaper to obtain.

STraTA starts with task augmentation that uses unlabeled texts from the target domain to synthesize a large amount of in-domain training data for an auxiliary task (i.e., natural language inference), which is then used for intermediate fine-tuning (see the figure below).

We show that task augmentation alone can significantly improve downstream performance across different tasks, generally outperforming other competing fine-tuning approaches in both high- and low-data regimes.

STraTA further uses the auxiliary-task model created by task augmentation as a base model for self-training, where it is fine-tuned on the available labeled data for the target task and is then used to infer predictions (pseudo labels) on unlabeled data for subsequent training.

Our experiments reveal that using a strong base model and training on a broad distribution of pseudo-labeled data are key factors for successful self-training, which we hope will enable the wider adoption of self-training in NLP.

With STraTA, we are able to substantially improve sample efficiency across 12 NLP benchmark datasets. Remarkably, when given only 8 labeled examples per class from the SST-2 sentiment dataset, our approach is competitive with standard fine-tuning on all 67K labeled examples.

Other interesting results:

1) randomly initialized model + STraTA outperforms BERT_BASE by a large margin on SST-2 while being competitive on SciTail.

2) BERT_BASE + STraTA substantially outperforms BERT_LARGE on both SST-2 and SciTail.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @tuvuumass

Tu Vu

@tuvuumass

15 Nov 20

@emnlp2020

Excited to share our @emnlp2020 paper on task transferability:

1) a large-scale empirical study w/ over 3,000 combinations of NLP tasks and data regimes within and across different classes of problems

2) task embedding methods to predict task transferability

1/12👇

https://twitter.com/colinraffel/status/1275856152878895104

Transfer learning with large-scale pre-trained language models has become the de-facto standard for state-of-the-art performance on many NLP tasks. Can fine-tuning these models on source tasks other than language modeling further improve target task performance? 🤔

The answer is yes, as shown by Phang et al. (2018), but the conditions for successful transfer remain opaque. Which combinations of tasks can perform well in this transfer setting? 🤔 An arbitrary combination often adversely impacts target task performance (Wang et al., 2019).

Read 12 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Tu Vu

Try unrolling a thread yourself!

More from @tuvuumass

Tu Vu

Did Thread Reader help you today?

Like this author's thread?