Tweet

@whybansal

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @LChoshen

♻️ Leshem Choshen ♻️

@LChoshen

Feb 7

@sebastiangoldt

Parallel generation from auto regressive LMs
Para ll el!
Well not exactly, use a fast LM to propose the next words first
arxiv.org/abs/2302.01318
@sebastiangoldt @advani_madhu @SaxeLab @KrzakalaF @zdeborova
@DeepMind

The story is very simple
Auto regressive models predict the next word given the last, annoying and - with a strong model - slow
Instead, they propose to use a fast model to predict the next words
Then check on all of those words whether the strong model agrees about them

For a bit more details:
q - strong model
p - poor model
p generated x1 .. xn words
q then calculates their probabilities (did I say on parallel?)
We accept them if q gives high probability (eq in fig)

Read 6 tweets

♻️ Leshem Choshen ♻️

@LChoshen

Feb 7

@zhangzhuosheng

Chain of Thought for vision
beating GPT3 by 16% and supposedly humans

Text and captions are not enough, but
with vision CoT does really well

@zhangzhuosheng @astonzhangAZ @mli65 Hai Zhao @karypis @smolix
arxiv.org/abs/2302.00923

What is this Chain of Though everyone are talking about?
Just asking the model to explain the answer, that's it.
It is a big deal because sometimes explaining before answering improves the answer
But not here (fig: rational = before and explain = after)

To get better results in the multimodal task, they...
Wait for it...
Use both modalities
Inserting the image representation too improves the rational
but also the answers
(more on how in fig, actual arch in paper)

Read 5 tweets

♻️ Leshem Choshen ♻️

@LChoshen

Dec 5, 2022

We want to pretrain🤞
Instead we finetune🚮😔
Could we collaborate?🤗

ColD Fusion:
🔄Recycle finetuning to multitask
➡️evolve pretrained models forever

On 35 datasets
+2% improvement over RoBERTa
+7% in few shot settings
🧵

#NLProc #MachinLearning #NLP #ML #modelRecyclying

@Shachar_Don

We all wish to improve pretraining
If only we had unlimited compute and data...
Together we have!

We propose a way to recycle finetuning
and transform it into multitask learning!

arxiv.org/abs/2212.01378

@Shachar_Don @VenezianElad @colinraffel @noamslonim @YoavKatz73 me

How to perform multitasking, by simply uploading models?

Collaborative Descent (ColD) Fusion is simple:
Start from a pretrained model
Let contributors finetune on it, and share their models
Fuse the models to get a new better model
Take the improved model as the new best model

Read 9 tweets

♻️ Leshem Choshen ♻️

@LChoshen

Mar 22, 2022

@aclmeeting

About generalization of different networks

Main finding: Generalization in pretraining follows a single dimension
Different networks, architectures, seeds, sizes but:
Similar performance → similar linguistic capabilities

@aclmeeting accepted (#NLProc)

Summary & story 🧵

@mrtz

It all began in a discussion of
C. Zhang, S. Bengio, @mrtz, @beenwrekt, @OriolVinyalsML
Fascinating work.
arxiv.org/abs/1611.03530

More about their work

https://twitter.com/ericjang11/status/912035613766852608?s=20&t=V22fZDXwLh5G4I10Td4rNQ

We wondered:
Why do networks learn but hardly overfit?
Why overfitting training doesn't hurt test?

It means VC dimension is a really bad way to think about learning

Read 27 tweets

♻️ Leshem Choshen ♻️

@LChoshen

Mar 21, 2022

During training, your loss goes up and down up and down up and down.

But how would it go if you magically went in a straight line
from init to learnt position?

Apparently smoothly down!

On the surprising Linear Interpolation:
#scientivism #deepRead #MachineLearning

@goodfellow_ian

It all started on ICLR2015(!)
@goodfellow_ian @OriolVinyalsML @SaxeLab
Checked points between the converged model and the random initialization.
They found that the loss between them is monotonically decreasing.

Why shouldn't it?
Well... The real question is why should it.

If the loss terrain is anything but a slope, we would expect bumps. Maybe there are different sinks (local minima), or you need to get a bad model before you reach the best model (topologically, you are in a ditch)

Read 22 tweets

♻️ Leshem Choshen ♻️

@LChoshen

Feb 9, 2022

I have just found a new phenomenon:
Linear mode connectivity

What is the loss of the mid-model?
A model somewhere between converged models with different seeds?

#MachineLearning

Take two models, put them in the loss space
The points between them are the mode connectivity.

If the models converge into different solutions\loss pits (blue), then there is a barrier between them, called "energy barrier" (yellow).

@tim_garipov

Apparently, there is 𝘴𝘰𝘮𝘦 path (connectivity) where the loss stays almost the same. It is also relatively simple.
arxiv.org/pdf/1802.10026…
@tim_garipov @Pavel_Izmailov @FullStackML @andrewgwils
arxiv.org/pdf/1803.00885…
@FelixDrRelax @FredHamprecht

Read 15 tweets

Share this page!

♻️ Leshem Choshen ♻️

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @LChoshen

♻️ Leshem Choshen ♻️

♻️ Leshem Choshen ♻️

♻️ Leshem Choshen ♻️

♻️ Leshem Choshen ♻️

♻️ Leshem Choshen ♻️

♻️ Leshem Choshen ♻️

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!