♻️ Leshem Choshen ♻️ Profile picture
Dec 5, 2022 9 tweets 5 min read Read on X
We want to pretrain🤞
Instead we finetune🚮😔
Could we collaborate?🤗

ColD Fusion:
🔄Recycle finetuning to multitask
➡️evolve pretrained models forever

On 35 datasets
+2% improvement over RoBERTa
+7% in few shot settings
🧵

#NLProc #MachinLearning #NLP #ML #modelRecyclying
We all wish to improve pretraining
If only we had unlimited compute and data...
Together we have!

We propose a way to recycle finetuning
and transform it into multitask learning!

arxiv.org/abs/2212.01378

@Shachar_Don @VenezianElad @colinraffel @noamslonim @YoavKatz73 me
How to perform multitasking, by simply uploading models?

Collaborative Descent (ColD) Fusion is simple:
Start from a pretrained model
Let contributors finetune on it, and share their models
Fuse the models to get a new better model
Take the improved model as the new best model
What is fusing?
In short, it is creating one model from several
Practically, we just average the weights of finetuned models and it is good enough

More fusing methods and details:
So we can iteratively collect finetuned models from the community and get better models. What could that achieve?
A) learn on the tasks contributed along the way (Fig)
B) become a better pretrained model!
and keep improving with more data and contributors!
As a pretrained
🟦ColD Fusion is just great!
🟩much better than multitasking
⬜️not to mention vanilla RoBERTa

We did not expect that, but we surely did not expect to
🟧beat MUPPET!
SoTA multitask on more datasets, tuned, tweaks and all

We have none of those, just a new method
It also does as well on unseen datasets as on the 35 seen datasets (top yellow blue)

but remember, it is a pretrained model, are you surprised finetuning on data already seen in pretraining\multitask is not helpful?

More details in the paper...
ColD Fusion provides great benefits when only 100 test examples are available

*These improvements are few shot on unseen dataset
To sum, ColD Fusion allows a model
to continually evolve
♻️just share and recycle

We have the algorithm, now we just need to start using it!
Creating a platform and scaling is our next goal, interested? contact us.
Let's do it together!

Models: huggingface.co/ibm/ColD-Fusion

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with ♻️ Leshem Choshen ♻️

♻️ Leshem Choshen ♻️ Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @LChoshen

Feb 15
DoRA explores the magnitude and direction and
surpasses LoRA quite significantly

This is done with an empirical finding that I can't wrap my head around
@NVIDIAAI

@nbasyl_tw @chienyi_wang @yin_hongxu @PavloMolchanov @CMHungSteven arxiv.org/abs/2402.09353
Image
Lora as you probably know learns in addition to W (some dense matrix e.g. the fully connected) AB, two matrices (of low dim between them)
So you learn
W+AB
DORA suggests another parameter, magnitude
m*(W+AB)\(W+AB) Image
But why? you can always learn large and small AB right? and the original pretrained W is in the right magnitude.
I ask the same question. However, fine-tuning apparently is doing that and LORA doesn't, so something is rotten in something is rotten in the state of peftemark🇩🇰
Read 10 tweets
May 22, 2023
In-Context-Learning == gradient descent or disregards labels completely?!
Why not both?

Models recognize the task but also learn it
& The benefits of actual learning grow with # examples and model size
Jane Pan @gaotianyu1350 @__howardchen @danqi_chen
arxiv.org/abs/2305.09731
Image
So we have seen papers showing that models gain a lot from seeing examples (ICL) with random labels
We also have seen claims ICL is quite similar to taking a gradient step

@akyurekekin @tengyuma @jacobandreas @denny_zhou
&
Read 7 tweets
Feb 7, 2023
Parallel generation from auto regressive LMs
Para ll el!
Well not exactly, use a fast LM to propose the next words first
arxiv.org/abs/2302.01318
@sebastiangoldt @advani_madhu @SaxeLab @KrzakalaF @zdeborova
@DeepMind
The story is very simple
Auto regressive models predict the next word given the last, annoying and - with a strong model - slow
Instead, they propose to use a fast model to predict the next words
Then check on all of those words whether the strong model agrees about them
For a bit more details:
q - strong model
p - poor model
p generated x1 .. xn words
q then calculates their probabilities (did I say on parallel?)
We accept them if q gives high probability (eq in fig)
Read 6 tweets
Feb 7, 2023
Chain of Thought for vision
beating GPT3 by 16% and supposedly humans

Text and captions are not enough, but
with vision CoT does really well

@zhangzhuosheng @astonzhangAZ @mli65 Hai Zhao @karypis @smolix
arxiv.org/abs/2302.00923
What is this Chain of Though everyone are talking about?
Just asking the model to explain the answer, that's it.
It is a big deal because sometimes explaining before answering improves the answer
But not here (fig: rational = before and explain = after)
To get better results in the multimodal task, they...
Wait for it...
Use both modalities
Inserting the image representation too improves the rational
but also the answers
(more on how in fig, actual arch in paper)
Read 5 tweets
Feb 6, 2023
Few-shot learning almost reaches traditional machine translation

Xavier Garcia @whybansal @ColinCherry George Foster, Maxim Krikun @fengfangxiaoyu @melvinjohnsonp @orf_bnw
arxiv.org/abs/2302.01398
#enough2skim #NLProc #neuralEmpty Image
The setting is quite simple:
Take a smaller (8B vs 100-500B in some baselines)
bilingual LM in two languages (one might be low resource see fig)
Show it a few translation examples in the prompt
Say abra kadabra 🪄
and you got a very good translation system Image
They reproduce known results and detail how to do it (especially for low resource)
e.g., continue previous training for speed\stability have many epochs on the monolingual training (fig)
etc. Image
Read 5 tweets
Oct 26, 2022
Is data really important for pretraining?
Could we just pretrain on 1 picture? Only synthetic text? Fractals?
A 🧵 summing the image and text papers that do just that.
and they all have a similar conclusion🤔 Image
The main idea behind pretraining claims that, given some hard enough loss, we can train on a lot of data and learn how the world works so well, that we could easily transfer this knowledge to perform the tasks we really care about.
All this research defies common pretraining wisdom and wonders:
Is it really the huge amounts of data seen that make the difference?
Or is it something much more basic learnt?
Read 31 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(