The pretrain-then-finetune paradigm is a staple of transfer learning but is it always the right way to use auxiliary tasks? In our #ICLR2022 paper openreview.net/forum?id=2bO2x…, we show that in settings where the end-task is known in advance, we can do better.
[1/n]
@gneubig @atalwalkar @pmichelX TL;DR, instead of decoupled pretrain-then-finetune, we multitask the end-task with the auxiliary objectives. We use meta-learning to determine end-task and auxiliary task weights. Our approach improves performance and data-efficiency in low-resource settings.
[2/n]
@gneubig @atalwalkar @pmichelX Consider this common scenario. You have an end-task, E (say sentiment classification on movie reviews), and a set of auxiliary objectives, A, (eg MLM on generic or task data) you believe can improve results on E.
[3/n]
@gneubig @atalwalkar @pmichelX The common approach (e.g. arxiv.org/abs/2004.10964, arxiv.org/abs/2101.11038) is to further pre-train a large language model on A, before fine-tuning on E.
[4/n]
@gneubig @atalwalkar @pmichelX The pretrain-then-finetune approach can be written out as below. Pre-training is end-task agnostic: Equation 1 occurs entirely before training on the end-task Equation 2, and does not explicitly incorporate the end-task objective. What are the consequences of this?
[5/n]
@gneubig @atalwalkar @pmichelX What if the set of auxiliary objectives contains candidates that may be harmful to E? Since Equations 1 and 2 are decoupled, how do we perform hyperparameter optimization for stage 1? How much auxiliary data do we need for Equation 1? Our approach addresses these questions
[6/n]
@gneubig @atalwalkar @pmichelX We introduce the end-task, E, into what would otherwise be the pre-training step (Equation 3). This creates a tight coupling and allows explicit interactions between A and E. We use scalar weights, w, to modulate between the E and the tasks in the auxiliary set A.
[7/n]
@gneubig @atalwalkar @pmichelX Our approach, which we dub TARTAN (end-TAsk awaRe TrAiNing) comes in 2 flavors. The compute friendly multitask version MT-TARTAN where w are statically predefined weights and META-TARTAN where we meta-learn w to prioritise E.
[8/n]
@gneubig @atalwalkar @pmichelX Meta-learning the weights w, allow us to upweight tasks in A that are useful to E and down-weight those that are harmful to E. This is done adaptively throughout training and can capture the variable importance of different objectives at different stages in training.
[9/n]
@gneubig @atalwalkar @pmichelX Traditionally, we pre-train for as long as possible on as much data as possible. This is a significant compute and resource burden. TARTAN can early stop when validation performance on E plateaus - thus saving compute and resulting in data-efficiency.
[10/n]
@gneubig @atalwalkar @pmichelX How well does this perform in practice ? We investigate low-resource tasks. Using task-only data to construct the objective A, we consistently observe >2% improvement over baselines. All models are evaluated under continued training of RoBERTa.
@gneubig @atalwalkar @pmichelX We observe gains in data-efficiency when using TARTAN. For the same amount of domain data, TARTAN gives better performance. TARTAN either outperforms or comes close to the performance of pretrain-then-finetune approaches while using much less extra data in comparison.
[12/n]
@gneubig @atalwalkar @pmichelX To summarise, when we know our end-task in advance, multitasking it with auxiliary objectives is a viable alternative to the pretrain-then-finetune approach. For more details, check out the full paper - openreview.net/forum?id=2bO2x…. Code coming soon.
[13/n]
@gneubig @atalwalkar @pmichelX Finally, thanks to my amazing co-authors @pmichelX, @atalwalkar and @gneubig !
[14/n]
@gneubig @atalwalkar @pmichelX PS - as is common in research, others also recently explored similar ideas concurrently. Feel free to explore these works also:
arxiv.org/abs/2111.04130
arxiv.org/abs/2105.14095 @ShuxiaoC
arxiv.org/abs/2111.01754 @RaghuAniruddh, @jonLorraine9
[15/n]

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lucio Dery Jnr Mwinm

Lucio Dery Jnr Mwinm Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

:(