Tweet

Leshem Choshen

Mar 21 • 22 tweets • 8 min read

During training, your loss goes up and down up and down up and down.

But how would it go if you magically went in a straight line
from init to learnt position?

Apparently smoothly down!

On the surprising Linear Interpolation:
#scientivism #deepRead #MachineLearning

@goodfellow_ian

It all started on ICLR2015(!)
@goodfellow_ian @OriolVinyalsML @SaxeLab
Checked points between the converged model and the random initialization.
They found that the loss between them is monotonically decreasing.

Why shouldn't it?
Well... The real question is why should it.

If the loss terrain is anything but a slope, we would expect bumps. Maybe there are different sinks (local minima), or you need to get a bad model before you reach the best model (topologically, you are in a ditch)

One way to look at this interpolation is thinking on the graph as a slice of the loss space.
A straight line between init and converged looks as above (going down a hill).
If you got up the hill before getting down, you hit a
barrier

The original conclusion was:
"The reason for the success of SGD on a wide variety of tasks is now clear: these tasks are relatively easy to optimize."

You probably feel this itch too... Don't worry, later literature contested this.

@jefrankle

@jefrankle continued this work with modern settings (2020) arxiv.org/pdf/2012.06898…
He replicated the main results. But:

1⃣ In larger datasets loss improvement came only close to the endpoint, so optimization is more like looking for a hole in a field than going down a mountain (convex).

2⃣ Loss space is not simple.
The way from init to other points in space (iterations in training) goes through barriers (see the hills in the figure).
So it is not that all optimization is convex and simple. It is somehow a trait of the converged point.

@james_r_lucas

@james_r_lucas @juhan_bae @michaelrzhang @stanislavfort Richard Zemel @RogerGrosse replicated it

But also found when it does NOT work:
large learning rate, batch-norm, adam

At least two may result from just going far away before converging
proceedings.mlr.press/v139/lucas21a.…

This graph is so pretty, I couldn't resist putting it too... (speaks for itself)

@jefrankle

Just now came the last call on this topic by Tiffany Vlaar and @jefrankle.
They share the mixed feelings about the phenomenon.
Specifically, they show test results and the loss slice from init to your point are not correlated.
arxiv.org/pdf/2106.16004…

First, they show that pretraining reduces the barriers (Intuitive, true, but someone needed to show it) and

bad init adds such barriers:
Intuitively, to converge one needs to move out of bad local minima, not only to decrease to a minima

Making more complex training data adds barriers and (obviously) improves test score.
(my) possible explanation: complex training forces exiting the initial local minima (not minima for each subset of the data = batch).

Weight decay (in addition to learning rate and adam mentioned) also controls the loss path seen in interpolation.

The authors deduce that this path should not tell you about what happened to the model, as more barriers mean again higher test scores.

I probably miss something, but IIUC it means that they are connected. If you passed barriers before convergence, it might just be that you got out of some local minima and reached a lower point.
(yes, one can devise a bad optimizer that just randomly moves, it will pass barriers)

Of course, if you start from a point that leads to less barriers with the same optimization, that IS a good thing (it didn't bother going away, you are already near a minima)

Last, the authors say their results are negative (as far as I can see they are just analysis work, which in this untreated territory is great) and they share it for the "widespread use". What do people use this for? Is it actually well known?

To sum up, I feel there is a lot to talk about the meaning of all of this, fascinating! What do pertaining, bad init, decay, momentum and so on has to do with the loss path, but I believe those results might click on different points of view, so I will leave mine out of it.

Oh, they may hint on it at one point near the end "a more difficult optimization trajectory can lead to improved generalization"

Oh, a last note, none of it was really tested on #NLProc ... So we assume it is all just the same.

https://twitter.com/LChoshen/status/1491375423120613378

If you got this far, you might like this as well

https://twitter.com/LChoshen/status/1491375423120613378

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @LChoshen

Leshem Choshen

@LChoshen

Mar 22

@aclmeeting

About generalization of different networks

Main finding: Generalization in pretraining follows a single dimension
Different networks, architectures, seeds, sizes but:
Similar performance → similar linguistic capabilities

@aclmeeting accepted (#NLProc)

Summary & story 🧵

@mrtz

It all began in a discussion of
C. Zhang, S. Bengio, @mrtz, @beenwrekt, @OriolVinyalsML
Fascinating work.
arxiv.org/abs/1611.03530

More about their work

https://twitter.com/ericjang11/status/912035613766852608?s=20&t=V22fZDXwLh5G4I10Td4rNQ

We wondered:
Why do networks learn but hardly overfit?
Why overfitting training doesn't hurt test?

It means VC dimension is a really bad way to think about learning

Read 25 tweets

Leshem Choshen

@LChoshen

Oct 31, 2021

Model combination\ensembling:
Average ensembling is practical - but naive.
Combine considering each network's strengths, much better!
Moreover, let's make the networks diverse so they will have different strengths.

Wenjuan Han & Hwee Tou Ng (no twitters?)
#enough2skim #NLProc

The basic idea is quite simple:
Given some models, why would we want the average? We want to rely on each one(or group) when it is more likely to be the correct one.
This was actually introduced in our previous work (as admitted by the authors) in
aclanthology.org/W19-4414.pdf

The paper's addition:
1. Given a set of black-box models we may train at least one of them to be different from the rest with RL.
2. we can use more sophisticated NNs to combine the outputs
3. we can ignore domain knowledge for the combination (I am not sure this is a bonus)

Read 10 tweets

Leshem Choshen

@LChoshen

Oct 24, 2021

Ever since MAEGE (aclanthology.org/P18-1127/) I have a soft spot for evaluation of evaluation = EoE (especially when they are automatic, but without is still ok).

@ebriakou

EoE for style transfer in multiple languages.
@ebriakou, @swetaagrawal20, @Tetreault_NLP, @MarineCarpuat
arxiv.org/pdf/2110.10668…
They end up with the following best practices:

Capture formality - XLM-R with regression not classification
Preservation - with chrf not BLEU
Fluency - XLM-R but there is room for improvement
System Ranking - XLM-R and chrf
Crosslingual Transfer - rely on zero shot not machine translation

Read 6 tweets

Leshem Choshen

@LChoshen

Oct 20, 2021

@JenniferCWhite

Are all language orders as hard?
Supposedly, for RNNs yes, for Transformers no

@JenniferCWhite @sleepinyourhat
aclanthology.org/2021.acl-long.…

github.com/rycolab/artifi… (currently empty)
#NLProc
Really cool, with a caveat

The paper creates synthetic languages (using a PCFG) with various ordering rules, being able to compare each order.

They also add agreement, and a vocabulary to introduce more of real language important features (e.g. long-distance dependencies)

Read 15 tweets

Leshem Choshen

@LChoshen

Aug 25, 2021

A product of an unlikely collaboration, which I am thankful for:

When NLP and code researchers meet

@HujiIdan

Huge 𝙘𝙤𝙢𝙢𝙞𝙩 𝙨𝙪𝙢𝙢𝙖𝙧𝙞𝙯𝙖𝙩𝙞𝙤𝙣 dataset

@HujiIdan and myself
arxiv.org/abs/2108.10763

The dataset cleans tons of open source projects to have only ones with high quality committing habits

(e.g. large active projects with commits that are of significant length etc.)

Read 11 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Leshem Choshen

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @LChoshen

Leshem Choshen

Leshem Choshen

Leshem Choshen

Leshem Choshen

Leshem Choshen

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Like this author's thread?