Tweet

Leshem Choshen

Mar 22 • 27 tweets • 11 min read

@aclmeeting

About generalization of different networks

Main finding: Generalization in pretraining follows a single dimension
Different networks, architectures, seeds, sizes but:
Similar performance → similar linguistic capabilities

@aclmeeting accepted (#NLProc)

Summary & story 🧵

@mrtz

It all began in a discussion of
C. Zhang, S. Bengio, @mrtz, @beenwrekt, @OriolVinyalsML
Fascinating work.
arxiv.org/abs/1611.03530

More about their work

https://twitter.com/ericjang11/status/912035613766852608?s=20&t=V22fZDXwLh5G4I10Td4rNQ

We wondered:
Why do networks learn but hardly overfit?
Why overfitting training doesn't hurt test?

It means VC dimension is a really bad way to think about learning

During discussion I asked:
What if networks learnt ~gradually?
They pick rules that explain as many of the examples seen,

then explain the unexplained
and so on

@GHacohen

In that case, at some point
rules would turn to memorization,

but there would be little overfitting, as only what can't be explained would be memorized.

@GHacohen took the challenge and we went to investigate.

In VC terms:
All hypotheses are learnable by networks, but when two hypotheses explain all the data,
one is considered to explain better

@vitalyFM

Before we continue
Another related answer that did come since, might be the long-tail phenomena:
Memorization is generalization.

@vitalyFM, Chiyuan Zhang
arxiv.org/abs/2008.03703
arxiv.org/abs/1906.05271

Networks memorize a lot, they guess many rules, if a rule is correct, it is a generalization, otherwise, we got one training example memorized, not too bad, it won't reappear in test.
And to generalize from few examples, this is the only way to go.

This has various practical implications too: e.g. in hallucinations

https://twitter.com/LChoshen/status/1384753216462757889?s=20&t=fho9OzK5b76Tqw0NdftfsA

@GHacohen

So @GHacohen and me (With our supervisors Daphna Weinshall and @AbendOmri) went to check.
Do all networks learn the same things?

They definitely have different weights.
But are those the same functions?

If so,
Is it done in the same ORDER?

@GHacohen

Apparently, Classifiers learn in the same order.
They make the same mistakes
They are correct on the same examples

That is,
If they have the same accuracy

@GHacohen me and Prof. Weinshall
proceedings.mlr.press/v119/hacohen20…

If a network has learnt more than another, then it is mostly correct on the SAME things,
but on more too.

So learning is gradual!

They also show that what is learnt first could be predicted (specifically by principal components)
arxiv.org/pdf/2105.05553…

In our current work, we change two things
1: language models rather than classifiers (#MachineLearning -> #NLProc)
2: linguistic generalizations, not specific examples

@a_stadt

For that, we used BLIMP dataset by
@a_stadt @AliciaVParrish @liu_haokun @anhadmy @WeiPengPITT @shengfu_wang @sleepinyourhat
arxiv.org/abs/1912.00582
Its idea is simple and brilliant

Get minimal pairs of sentences (figure)
Change only a phenomenon between them
Check which gets a higher probability by the language model

With BLIMP we get 67 groups of sentences.
Each group shares a grammatical phenomenon that changed (e.g. "she" means verb in singular-> plural)

If a model consistently prefers the right one, it learnt this generalization

We find that LMs not only find the same generalizations hard,
but they also do it in the same order (during pertaining).

Changing data (yellow) changes the generalizations only at the beginning
(thus data might affect fine-tuning generalizations, but pertaining is long...)

Order is the same in different architectures too

The performance of a network determines its generalizations. In other words,
there exist better networks, but they bound to learn on the same dimension, only more.

We also analyze some of the things learnt.
For one, those do NOT follow known linguistic categories (other than morphology).
So it doesn't learn quantifiers or syntax, but something else we don't really understand.

Like in humans, we need some theory to predict what would be learnt to be able to effectively research that.
But this, is future work.

So what practical implications does it already have?
One is that we can study what LMs learn during training.
Before that, we didn't know if understanding BERT would be relevant to understanding an LSTM or even RoBERTA.

Now we do, it does.
Go understand learning trajectories!

@ZEYULIU10

And a work like that was indeed just done:
@ZEYULIU10 @yizhongwyz @wittgen_ball @HannaHajishirzi @nlpnoah show that

early in training information for linguistic classification is already found in the model parameters.

arxiv.org/abs/2104.07885

For more on our work, see the older thread

https://twitter.com/LChoshen/status/1437729452935557122?s=20&t=lJcQPJRIoKZiyFra0XJenA

P.S. meet me in ACL!
Looking for a post in 2023...

And I forgot the link to our paper, embarrassing:
arxiv.org/abs/2109.06096

Forgot the link to the paper, oops
arxiv.org/abs/2109.06096

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

Read 11 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Leshem Choshen

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @LChoshen

Leshem Choshen

Leshem Choshen

Leshem Choshen

Leshem Choshen

Leshem Choshen

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Like this author's thread?