Tweet

Leshem Choshen

Dec 5 • 9 tweets • 5 min read

We want to pretrain🤞
Instead we finetune🚮😔
Could we collaborate?🤗

ColD Fusion:
🔄Recycle finetuning to multitask
➡️evolve pretrained models forever

On 35 datasets
+2% improvement over RoBERTa
+7% in few shot settings
🧵

#NLProc #MachinLearning #NLP #ML #modelRecyclying

@Shachar_Don

We all wish to improve pretraining
If only we had unlimited compute and data...
Together we have!

We propose a way to recycle finetuning
and transform it into multitask learning!

arxiv.org/abs/2212.01378

@Shachar_Don @VenezianElad @colinraffel @noamslonim @YoavKatz73 me

How to perform multitasking, by simply uploading models?

Collaborative Descent (ColD) Fusion is simple:
Start from a pretrained model
Let contributors finetune on it, and share their models
Fuse the models to get a new better model
Take the improved model as the new best model

https://twitter.com/LChoshen/status/1513445649890615297

What is fusing?
In short, it is creating one model from several
Practically, we just average the weights of finetuned models and it is good enough

More fusing methods and details:

https://twitter.com/LChoshen/status/1513445649890615297

So we can iteratively collect finetuned models from the community and get better models. What could that achieve?
A) learn on the tasks contributed along the way (Fig)
B) become a better pretrained model!
and keep improving with more data and contributors!

As a pretrained
🟦ColD Fusion is just great!
🟩much better than multitasking
⬜️not to mention vanilla RoBERTa

We did not expect that, but we surely did not expect to
🟧beat MUPPET!
SoTA multitask on more datasets, tuned, tweaks and all

We have none of those, just a new method

It also does as well on unseen datasets as on the 35 seen datasets (top yellow blue)

but remember, it is a pretrained model, are you surprised finetuning on data already seen in pretraining\multitask is not helpful?

More details in the paper...

ColD Fusion provides great benefits when only 100 test examples are available

*These improvements are few shot on unseen dataset

To sum, ColD Fusion allows a model
to continually evolve
♻️just share and recycle

We have the algorithm, now we just need to start using it!
Creating a platform and scaling is our next goal, interested? contact us.
Let's do it together!

Models: huggingface.co/ibm/ColD-Fusion

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @LChoshen

Leshem Choshen

@LChoshen

Mar 22

@aclmeeting

About generalization of different networks

Main finding: Generalization in pretraining follows a single dimension
Different networks, architectures, seeds, sizes but:
Similar performance → similar linguistic capabilities

@aclmeeting accepted (#NLProc)

Summary & story 🧵

@mrtz

It all began in a discussion of
C. Zhang, S. Bengio, @mrtz, @beenwrekt, @OriolVinyalsML
Fascinating work.
arxiv.org/abs/1611.03530

More about their work

https://twitter.com/ericjang11/status/912035613766852608?s=20&t=V22fZDXwLh5G4I10Td4rNQ

We wondered:
Why do networks learn but hardly overfit?
Why overfitting training doesn't hurt test?

It means VC dimension is a really bad way to think about learning

Read 27 tweets

Leshem Choshen

@LChoshen

Mar 21

During training, your loss goes up and down up and down up and down.

But how would it go if you magically went in a straight line
from init to learnt position?

Apparently smoothly down!

On the surprising Linear Interpolation:
#scientivism #deepRead #MachineLearning

@goodfellow_ian

It all started on ICLR2015(!)
@goodfellow_ian @OriolVinyalsML @SaxeLab
Checked points between the converged model and the random initialization.
They found that the loss between them is monotonically decreasing.

Why shouldn't it?
Well... The real question is why should it.

If the loss terrain is anything but a slope, we would expect bumps. Maybe there are different sinks (local minima), or you need to get a bad model before you reach the best model (topologically, you are in a ditch)

Read 22 tweets

Leshem Choshen

@LChoshen

Feb 9

I have just found a new phenomenon:
Linear mode connectivity

What is the loss of the mid-model?
A model somewhere between converged models with different seeds?

#MachineLearning

Take two models, put them in the loss space
The points between them are the mode connectivity.

If the models converge into different solutions\loss pits (blue), then there is a barrier between them, called "energy barrier" (yellow).

@tim_garipov

Apparently, there is 𝘴𝘰𝘮𝘦 path (connectivity) where the loss stays almost the same. It is also relatively simple.
arxiv.org/pdf/1802.10026…
@tim_garipov @Pavel_Izmailov @FullStackML @andrewgwils
arxiv.org/pdf/1803.00885…
@FelixDrRelax @FredHamprecht

Read 15 tweets

Leshem Choshen

@LChoshen

Oct 31, 2021

Model combination\ensembling:
Average ensembling is practical - but naive.
Combine considering each network's strengths, much better!
Moreover, let's make the networks diverse so they will have different strengths.

Wenjuan Han & Hwee Tou Ng (no twitters?)
#enough2skim #NLProc

The basic idea is quite simple:
Given some models, why would we want the average? We want to rely on each one(or group) when it is more likely to be the correct one.
This was actually introduced in our previous work (as admitted by the authors) in
aclanthology.org/W19-4414.pdf

The paper's addition:
1. Given a set of black-box models we may train at least one of them to be different from the rest with RL.
2. we can use more sophisticated NNs to combine the outputs
3. we can ignore domain knowledge for the combination (I am not sure this is a bonus)

Read 10 tweets

Leshem Choshen

@LChoshen

Oct 24, 2021

Ever since MAEGE (aclanthology.org/P18-1127/) I have a soft spot for evaluation of evaluation = EoE (especially when they are automatic, but without is still ok).

@ebriakou

EoE for style transfer in multiple languages.
@ebriakou, @swetaagrawal20, @Tetreault_NLP, @MarineCarpuat
arxiv.org/pdf/2110.10668…
They end up with the following best practices:

Capture formality - XLM-R with regression not classification
Preservation - with chrf not BLEU
Fluency - XLM-R but there is room for improvement
System Ranking - XLM-R and chrf
Crosslingual Transfer - rely on zero shot not machine translation

Read 6 tweets

Leshem Choshen

@LChoshen

Oct 20, 2021

@JenniferCWhite

Are all language orders as hard?
Supposedly, for RNNs yes, for Transformers no

@JenniferCWhite @sleepinyourhat
aclanthology.org/2021.acl-long.…

github.com/rycolab/artifi… (currently empty)
#NLProc
Really cool, with a caveat

The paper creates synthetic languages (using a PCFG) with various ordering rules, being able to compare each order.

They also add agreement, and a vocabulary to introduce more of real language important features (e.g. long-distance dependencies)

Read 15 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Leshem Choshen

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @LChoshen

Leshem Choshen

Leshem Choshen

Leshem Choshen

Leshem Choshen

Leshem Choshen

Leshem Choshen

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!