TLDR: forecasters didn't do well. But to some degree, this is because progress was "surprising".
4/12
Hypermind SotA forecasters revised their estimates upwards earlier this summer. But not massively.
@jacoabsteinhardt flagged in July '22 that Hypermind forecasts seemed low.
Metacalculus forecasts & Steinhardt's are higher.
ML researchers: contribute to future forecasts!
5/12
@Hwchung et al. explore a regime that uses a lot of instructions ~1800 tasks in total.
Models from small (80 million params) to v. big (540 billion params) are studied.
Interestingly, finetuning is relatively cheap (at most 1.6% of compute relative to pretraining).
6/12
Increasing model size continues to yield major gains.
Increasing the number of tasks helps, but brings diminishing returns.
7/12
To perform well in both chain-of-thought and non-chain-of-thought prompting paradigms, both kinds of data should be included in the finetuning mixture.
8/12
Flan finetuning (which includes chain-of-thought data) enables Flan-PaLM to benefit from chain-of-thought prompting in a zero-shot setting.
9/12
It's useful to note that human preferences about open-ended model outputs may not correlate with NLP benchmark scores.
Still, human annotators prefer Flan-PaLM 540B to PaLM 540B by a healthy margin.
10/12
If cats driving cars are your thing, Flan-PaLM can write funny poems.
11/12
Overall takeaway: instruction finetuning seems likely to be broadly applicable for pretrained language models.
For this study, datasets spanning 46 languages were gathered (collectively referred to as "xP3").
xP3 aims to mimic the distribution of languages found in ROOTS (the dataset used to pretrain BLOOM).
2/17
Three dataset variants were studied:
- English prompts on English datasets (P3)
- English prompts on multilingual datasets (xP3)
- Machine-translated prompts on multilingual datasets (xP3mt)