How to perform multitasking, by simply uploading models?
Collaborative Descent (ColD) Fusion is simple:
Start from a pretrained model
Let contributors finetune on it, and share their models
Fuse the models to get a new better model
Take the improved model as the new best model
What is fusing?
In short, it is creating one model from several
Practically, we just average the weights of finetuned models and it is good enough
So we can iteratively collect finetuned models from the community and get better models. What could that achieve?
A) learn on the tasks contributed along the way (Fig)
B) become a better pretrained model!
and keep improving with more data and contributors!
As a pretrained
🟦ColD Fusion is just great!
🟩much better than multitasking
⬜️not to mention vanilla RoBERTa
We did not expect that, but we surely did not expect to
🟧beat MUPPET!
SoTA multitask on more datasets, tuned, tweaks and all
We have none of those, just a new method
It also does as well on unseen datasets as on the 35 seen datasets (top yellow blue)
but remember, it is a pretrained model, are you surprised finetuning on data already seen in pretraining\multitask is not helpful?
More details in the paper...
ColD Fusion provides great benefits when only 100 test examples are available
*These improvements are few shot on unseen dataset
To sum, ColD Fusion allows a model
to continually evolve
♻️just share and recycle
We have the algorithm, now we just need to start using it!
Creating a platform and scaling is our next goal, interested? contact us.
Let's do it together!
Lora as you probably know learns in addition to W (some dense matrix e.g. the fully connected) AB, two matrices (of low dim between them)
So you learn
W+AB
DORA suggests another parameter, magnitude
m*(W+AB)\(W+AB)
But why? you can always learn large and small AB right? and the original pretrained W is in the right magnitude.
I ask the same question. However, fine-tuning apparently is doing that and LORA doesn't, so something is rotten in something is rotten in the state of peftemark🇩🇰
In-Context-Learning == gradient descent or disregards labels completely?!
Why not both?
Models recognize the task but also learn it
& The benefits of actual learning grow with # examples and model size
Jane Pan @gaotianyu1350 @__howardchen @danqi_chen arxiv.org/abs/2305.09731
So we have seen papers showing that models gain a lot from seeing examples (ICL) with random labels
The story is very simple
Auto regressive models predict the next word given the last, annoying and - with a strong model - slow
Instead, they propose to use a fast model to predict the next words
Then check on all of those words whether the strong model agrees about them
For a bit more details:
q - strong model
p - poor model
p generated x1 .. xn words
q then calculates their probabilities (did I say on parallel?)
We accept them if q gives high probability (eq in fig)
What is this Chain of Though everyone are talking about?
Just asking the model to explain the answer, that's it.
It is a big deal because sometimes explaining before answering improves the answer
But not here (fig: rational = before and explain = after)
To get better results in the multimodal task, they...
Wait for it...
Use both modalities
Inserting the image representation too improves the rational
but also the answers
(more on how in fig, actual arch in paper)
The setting is quite simple:
Take a smaller (8B vs 100-500B in some baselines)
bilingual LM in two languages (one might be low resource see fig)
Show it a few translation examples in the prompt
Say abra kadabra 🪄
and you got a very good translation system
They reproduce known results and detail how to do it (especially for low resource)
e.g., continue previous training for speed\stability have many epochs on the monolingual training (fig)
etc.
Is data really important for pretraining?
Could we just pretrain on 1 picture? Only synthetic text? Fractals?
A 🧵 summing the image and text papers that do just that.
and they all have a similar conclusion🤔
The main idea behind pretraining claims that, given some hard enough loss, we can train on a lot of data and learn how the world works so well, that we could easily transfer this knowledge to perform the tasks we really care about.
All this research defies common pretraining wisdom and wonders:
Is it really the huge amounts of data seen that make the difference?
Or is it something much more basic learnt?