Fine-tuning all the parameters of large pre-trained models works well and is the core of many SotA NLP results right now, but has some sharp edges. The size can make them difficult to work with and serve, plus each fine-tuning run creates a fork. (2/7)
Prompt Tuning, learning a small set of parameters that are prepended to the embedded input, can eliminate these problems. Freezing pre-trained models enables mixed-task batching and efficient ensembling, without the need for multiple copies. (3/7)
The size of the pre-trained model is critical to Prompt Tuning performance. As we scale T5 from Small to XXL, we see it closes the gap with full fine-tuning. (4/7)
By keeping the pretrained model frozen, Prompt Tuning avoids overfitting to a specific task, improving performance on domain shift problems. See our paper for details, as well as comparison with other recent “P*-Tuning” approaches. (5/7)
An interesting quirk of Prompt Tuning is that the hyper parameters are a little strange. For example, our learning rate is 0.3, two orders of magnitude larger than the default T5 learning rate of 0.001. (6/7)
A huge shout out to my amazing mentors, @noahconst and @aboSamoor, who were a big part of making this project possible. (7/7)
• • •
Missing some Tweet in this thread? You can try to
force a refresh