Leshem Choshen Profile picture
Mar 22 27 tweets 11 min read
About generalization of different networks

Main finding: Generalization in pretraining follows a single dimension
Different networks, architectures, seeds, sizes but:
Similar performance → similar linguistic capabilities

@aclmeeting accepted (#NLProc)

Summary & story 🧵
It all began in a discussion of
C. Zhang, S. Bengio, @mrtz, @beenwrekt, @OriolVinyalsML
Fascinating work.
arxiv.org/abs/1611.03530

More about their work
We wondered:
Why do networks learn but hardly overfit?
Why overfitting training doesn't hurt test?

It means VC dimension is a really bad way to think about learning
During discussion I asked:
What if networks learnt ~gradually?
They pick rules that explain as many of the examples seen,

then explain the unexplained
and so on
In that case, at some point
rules would turn to memorization,

but there would be little overfitting, as only what can't be explained would be memorized.

@GHacohen took the challenge and we went to investigate.
In VC terms:
All hypotheses are learnable by networks, but when two hypotheses explain all the data,
one is considered to explain better
Before we continue
Another related answer that did come since, might be the long-tail phenomena:
Memorization is generalization.

@vitalyFM, Chiyuan Zhang
arxiv.org/abs/2008.03703
arxiv.org/abs/1906.05271
Networks memorize a lot, they guess many rules, if a rule is correct, it is a generalization, otherwise, we got one training example memorized, not too bad, it won't reappear in test.
And to generalize from few examples, this is the only way to go.
This has various practical implications too: e.g. in hallucinations
So @GHacohen and me (With our supervisors Daphna Weinshall and @AbendOmri) went to check.
Do all networks learn the same things?

They definitely have different weights.
But are those the same functions?

If so,
Is it done in the same ORDER?
Apparently, Classifiers learn in the same order.
They make the same mistakes
They are correct on the same examples

That is,
If they have the same accuracy

@GHacohen me and Prof. Weinshall
proceedings.mlr.press/v119/hacohen20…
If a network has learnt more than another, then it is mostly correct on the SAME things,
but on more too.

So learning is gradual!
They also show that what is learnt first could be predicted (specifically by principal components)
arxiv.org/pdf/2105.05553…
In our current work, we change two things
1: language models rather than classifiers (#MachineLearning -> #NLProc)
2: linguistic generalizations, not specific examples
For that, we used BLIMP dataset by
@a_stadt @AliciaVParrish @liu_haokun @anhadmy @WeiPengPITT @shengfu_wang @sleepinyourhat
arxiv.org/abs/1912.00582
Its idea is simple and brilliant
Get minimal pairs of sentences (figure)
Change only a phenomenon between them
Check which gets a higher probability by the language model
With BLIMP we get 67 groups of sentences.
Each group shares a grammatical phenomenon that changed (e.g. "she" means verb in singular-> plural)

If a model consistently prefers the right one, it learnt this generalization
We find that LMs not only find the same generalizations hard,
but they also do it in the same order (during pertaining).

Changing data (yellow) changes the generalizations only at the beginning
(thus data might affect fine-tuning generalizations, but pertaining is long...)
Order is the same in different architectures too

The performance of a network determines its generalizations. In other words,
there exist better networks, but they bound to learn on the same dimension, only more.
We also analyze some of the things learnt.
For one, those do NOT follow known linguistic categories (other than morphology).
So it doesn't learn quantifiers or syntax, but something else we don't really understand.
Like in humans, we need some theory to predict what would be learnt to be able to effectively research that.
But this, is future work.
So what practical implications does it already have?
One is that we can study what LMs learn during training.
Before that, we didn't know if understanding BERT would be relevant to understanding an LSTM or even RoBERTA.

Now we do, it does.
Go understand learning trajectories!
And a work like that was indeed just done:
@ZEYULIU10 @yizhongwyz @wittgen_ball @HannaHajishirzi @nlpnoah show that

early in training information for linguistic classification is already found in the model parameters.

arxiv.org/abs/2104.07885
P.S. meet me in ACL!
Looking for a post in 2023...
And I forgot the link to our paper, embarrassing:
arxiv.org/abs/2109.06096
Forgot the link to the paper, oops
arxiv.org/abs/2109.06096

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Leshem Choshen

Leshem Choshen Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @LChoshen

Mar 21
During training, your loss goes up and down up and down up and down.

But how would it go if you magically went in a straight line
from init to learnt position?

Apparently smoothly down!

On the surprising Linear Interpolation:
#scientivism #deepRead #MachineLearning
It all started on ICLR2015(!)
@goodfellow_ian @OriolVinyalsML @SaxeLab
Checked points between the converged model and the random initialization.
They found that the loss between them is monotonically decreasing.
Why shouldn't it?
Well... The real question is why should it.

If the loss terrain is anything but a slope, we would expect bumps. Maybe there are different sinks (local minima), or you need to get a bad model before you reach the best model (topologically, you are in a ditch)
Read 22 tweets
Oct 31, 2021
Model combination\ensembling:
Average ensembling is practical - but naive.
Combine considering each network's strengths, much better!
Moreover, let's make the networks diverse so they will have different strengths.

Wenjuan Han & Hwee Tou Ng (no twitters?)
#enough2skim #NLProc
The basic idea is quite simple:
Given some models, why would we want the average? We want to rely on each one(or group) when it is more likely to be the correct one.
This was actually introduced in our previous work (as admitted by the authors) in
aclanthology.org/W19-4414.pdf
The paper's addition:
1. Given a set of black-box models we may train at least one of them to be different from the rest with RL.
2. we can use more sophisticated NNs to combine the outputs
3. we can ignore domain knowledge for the combination (I am not sure this is a bonus)
Read 10 tweets
Oct 24, 2021
Ever since MAEGE (aclanthology.org/P18-1127/) I have a soft spot for evaluation of evaluation = EoE (especially when they are automatic, but without is still ok).
EoE for style transfer in multiple languages.
@ebriakou, @swetaagrawal20, @Tetreault_NLP, @MarineCarpuat
arxiv.org/pdf/2110.10668…
They end up with the following best practices:
Capture formality - XLM-R with regression not classification
Preservation - with chrf not BLEU
Fluency - XLM-R but there is room for improvement
System Ranking - XLM-R and chrf
Crosslingual Transfer - rely on zero shot not machine translation
Read 6 tweets
Oct 20, 2021
Are all language orders as hard?
Supposedly, for RNNs yes, for Transformers no

@JenniferCWhite @sleepinyourhat
aclanthology.org/2021.acl-long.…

github.com/rycolab/artifi… (currently empty)
#NLProc
Really cool, with a caveat Image
The paper creates synthetic languages (using a PCFG) with various ordering rules, being able to compare each order. Image
They also add agreement, and a vocabulary to introduce more of real language important features (e.g. long-distance dependencies)
Read 15 tweets
Aug 25, 2021
A product of an unlikely collaboration, which I am thankful for:

When NLP and code researchers meet
Huge 𝙘𝙤𝙢𝙢𝙞𝙩 𝙨𝙪𝙢𝙢𝙖𝙧𝙞𝙯𝙖𝙩𝙞𝙤𝙣 dataset

@HujiIdan and myself
arxiv.org/abs/2108.10763
The dataset cleans tons of open source projects to have only ones with high quality committing habits

(e.g. large active projects with commits that are of significant length etc.)
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(