Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

♻️ Leshem Choshen ♻️

@LChoshen

Dec 5, 2022 • 9 tweets • 5 min read • Read on X

Scrolly

We want to pretrain🤞
Instead we finetune🚮😔
Could we collaborate?🤗

ColD Fusion:
🔄Recycle finetuning to multitask
➡️evolve pretrained models forever

On 35 datasets
+2% improvement over RoBERTa
+7% in few shot settings
🧵

#NLProc #MachinLearning #NLP #ML #modelRecyclying

@Shachar_Don

We all wish to improve pretraining
If only we had unlimited compute and data...
Together we have!

We propose a way to recycle finetuning
and transform it into multitask learning!

arxiv.org/abs/2212.01378

@Shachar_Don @VenezianElad @colinraffel @noamslonim @YoavKatz73 me

How to perform multitasking, by simply uploading models?

Collaborative Descent (ColD) Fusion is simple:
Start from a pretrained model
Let contributors finetune on it, and share their models
Fuse the models to get a new better model
Take the improved model as the new best model

https://twitter.com/LChoshen/status/1513445649890615297

What is fusing?
In short, it is creating one model from several
Practically, we just average the weights of finetuned models and it is good enough

More fusing methods and details:

https://twitter.com/LChoshen/status/1513445649890615297

So we can iteratively collect finetuned models from the community and get better models. What could that achieve?
A) learn on the tasks contributed along the way (Fig)
B) become a better pretrained model!
and keep improving with more data and contributors!

As a pretrained
🟦ColD Fusion is just great!
🟩much better than multitasking
⬜️not to mention vanilla RoBERTa

We did not expect that, but we surely did not expect to
🟧beat MUPPET!
SoTA multitask on more datasets, tuned, tweaks and all

We have none of those, just a new method

It also does as well on unseen datasets as on the 35 seen datasets (top yellow blue)

but remember, it is a pretrained model, are you surprised finetuning on data already seen in pretraining\multitask is not helpful?

More details in the paper...

ColD Fusion provides great benefits when only 100 test examples are available

*These improvements are few shot on unseen dataset

To sum, ColD Fusion allows a model
to continually evolve
♻️just share and recycle

We have the algorithm, now we just need to start using it!
Creating a platform and scaling is our next goal, interested? contact us.
Let's do it together!

Models: huggingface.co/ibm/ColD-Fusion

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @LChoshen

♻️ Leshem Choshen ♻️

@LChoshen

Feb 15, 2024

DoRA explores the magnitude and direction and
surpasses LoRA quite significantly

This is done with an empirical finding that I can't wrap my head around
@NVIDIAAI

@nbasyl_tw @chienyi_wang @yin_hongxu @PavloMolchanov @CMHungSteven arxiv.org/abs/2402.09353

Lora as you probably know learns in addition to W (some dense matrix e.g. the fully connected) AB, two matrices (of low dim between them)
So you learn
W+AB
DORA suggests another parameter, magnitude
m*(W+AB)\(W+AB)

But why? you can always learn large and small AB right? and the original pretrained W is in the right magnitude.
I ask the same question. However, fine-tuning apparently is doing that and LORA doesn't, so something is rotten in something is rotten in the state of peftemark🇩🇰

Read 10 tweets

♻️ Leshem Choshen ♻️

@LChoshen

May 22, 2023

In-Context-Learning == gradient descent or disregards labels completely?!
Why not both?

Models recognize the task but also learn it
& The benefits of actual learning grow with # examples and model size
Jane Pan @gaotianyu1350 @__howardchen @danqi_chen
arxiv.org/abs/2305.09731

https://twitter.com/sewon__min/status/1498334321869934595?s=20

So we have seen papers showing that models gain a lot from seeing examples (ICL) with random labels

https://twitter.com/sewon__min/status/1498334321869934595?s=20

https://twitter.com/akyurekekin/status/1597682726823337984?s=20

We also have seen claims ICL is quite similar to taking a gradient step

@akyurekekin @tengyuma @jacobandreas @denny_zhou
&

https://twitter.com/akyurekekin/status/1597682726823337984?s=20

https://twitter.com/arankomatsuzaki/status/1605377138181697537?s=20

Read 7 tweets

♻️ Leshem Choshen ♻️

@LChoshen

Feb 7, 2023

@sebastiangoldt

Parallel generation from auto regressive LMs
Para ll el!
Well not exactly, use a fast LM to propose the next words first
arxiv.org/abs/2302.01318
@sebastiangoldt @advani_madhu @SaxeLab @KrzakalaF @zdeborova
@DeepMind

The story is very simple
Auto regressive models predict the next word given the last, annoying and - with a strong model - slow
Instead, they propose to use a fast model to predict the next words
Then check on all of those words whether the strong model agrees about them

For a bit more details:
q - strong model
p - poor model
p generated x1 .. xn words
q then calculates their probabilities (did I say on parallel?)
We accept them if q gives high probability (eq in fig)

Read 6 tweets

♻️ Leshem Choshen ♻️

@LChoshen

Feb 7, 2023

@zhangzhuosheng

Chain of Thought for vision
beating GPT3 by 16% and supposedly humans

Text and captions are not enough, but
with vision CoT does really well

@zhangzhuosheng @astonzhangAZ @mli65 Hai Zhao @karypis @smolix
arxiv.org/abs/2302.00923

What is this Chain of Though everyone are talking about?
Just asking the model to explain the answer, that's it.
It is a big deal because sometimes explaining before answering improves the answer
But not here (fig: rational = before and explain = after)

To get better results in the multimodal task, they...
Wait for it...
Use both modalities
Inserting the image representation too improves the rational
but also the answers
(more on how in fig, actual arch in paper)

Read 5 tweets

♻️ Leshem Choshen ♻️

@LChoshen

Feb 6, 2023

@whybansal

Few-shot learning almost reaches traditional machine translation

Xavier Garcia @whybansal @ColinCherry George Foster, Maxim Krikun @fengfangxiaoyu @melvinjohnsonp @orf_bnw
arxiv.org/abs/2302.01398
#enough2skim #NLProc #neuralEmpty

The setting is quite simple:
Take a smaller (8B vs 100-500B in some baselines)
bilingual LM in two languages (one might be low resource see fig)
Show it a few translation examples in the prompt
Say abra kadabra 🪄
and you got a very good translation system

They reproduce known results and detail how to do it (especially for low resource)
e.g., continue previous training for speed\stability have many epochs on the monolingual training (fig)
etc.

Read 5 tweets

♻️ Leshem Choshen ♻️

@LChoshen

Oct 26, 2022

Is data really important for pretraining?
Could we just pretrain on 1 picture? Only synthetic text? Fractals?
A 🧵 summing the image and text papers that do just that.
and they all have a similar conclusion🤔

The main idea behind pretraining claims that, given some hard enough loss, we can train on a lot of data and learn how the world works so well, that we could easily transfer this knowledge to perform the tasks we really care about.

All this research defies common pretraining wisdom and wonders:
Is it really the huge amounts of data seen that make the difference?
Or is it something much more basic learnt?

Read 31 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

♻️ Leshem Choshen ♻️

Try unrolling a thread yourself!

More from @LChoshen

♻️ Leshem Choshen ♻️

♻️ Leshem Choshen ♻️

♻️ Leshem Choshen ♻️

♻️ Leshem Choshen ♻️

♻️ Leshem Choshen ♻️

♻️ Leshem Choshen ♻️

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!