How to perform multitasking, by simply uploading models?
Collaborative Descent (ColD) Fusion is simple:
Start from a pretrained model
Let contributors finetune on it, and share their models
Fuse the models to get a new better model
Take the improved model as the new best model
What is fusing?
In short, it is creating one model from several
Practically, we just average the weights of finetuned models and it is good enough
So we can iteratively collect finetuned models from the community and get better models. What could that achieve?
A) learn on the tasks contributed along the way (Fig)
B) become a better pretrained model!
and keep improving with more data and contributors!
As a pretrained
🟦ColD Fusion is just great!
🟩much better than multitasking
⬜️not to mention vanilla RoBERTa
We did not expect that, but we surely did not expect to
🟧beat MUPPET!
SoTA multitask on more datasets, tuned, tweaks and all
We have none of those, just a new method
It also does as well on unseen datasets as on the 35 seen datasets (top yellow blue)
but remember, it is a pretrained model, are you surprised finetuning on data already seen in pretraining\multitask is not helpful?
More details in the paper...
ColD Fusion provides great benefits when only 100 test examples are available
*These improvements are few shot on unseen dataset
To sum, ColD Fusion allows a model
to continually evolve
♻️just share and recycle
We have the algorithm, now we just need to start using it!
Creating a platform and scaling is our next goal, interested? contact us.
Let's do it together!
Main finding: Generalization in pretraining follows a single dimension
Different networks, architectures, seeds, sizes but:
Similar performance → similar linguistic capabilities
It all started on ICLR2015(!) @goodfellow_ian@OriolVinyalsML@SaxeLab
Checked points between the converged model and the random initialization.
They found that the loss between them is monotonically decreasing.
Why shouldn't it?
Well... The real question is why should it.
If the loss terrain is anything but a slope, we would expect bumps. Maybe there are different sinks (local minima), or you need to get a bad model before you reach the best model (topologically, you are in a ditch)
Model combination\ensembling:
Average ensembling is practical - but naive.
Combine considering each network's strengths, much better!
Moreover, let's make the networks diverse so they will have different strengths.
The basic idea is quite simple:
Given some models, why would we want the average? We want to rely on each one(or group) when it is more likely to be the correct one.
This was actually introduced in our previous work (as admitted by the authors) in aclanthology.org/W19-4414.pdf
The paper's addition: 1. Given a set of black-box models we may train at least one of them to be different from the rest with RL. 2. we can use more sophisticated NNs to combine the outputs 3. we can ignore domain knowledge for the combination (I am not sure this is a bonus)
Ever since MAEGE (aclanthology.org/P18-1127/) I have a soft spot for evaluation of evaluation = EoE (especially when they are automatic, but without is still ok).
Capture formality - XLM-R with regression not classification
Preservation - with chrf not BLEU
Fluency - XLM-R but there is room for improvement
System Ranking - XLM-R and chrf
Crosslingual Transfer - rely on zero shot not machine translation