Main finding: Generalization in pretraining follows a single dimension
Different networks, architectures, seeds, sizes but:
Similar performance → similar linguistic capabilities
Networks memorize a lot, they guess many rules, if a rule is correct, it is a generalization, otherwise, we got one training example memorized, not too bad, it won't reappear in test.
And to generalize from few examples, this is the only way to go.
This has various practical implications too: e.g. in hallucinations
If a network has learnt more than another, then it is mostly correct on the SAME things,
but on more too.
So learning is gradual!
They also show that what is learnt first could be predicted (specifically by principal components) arxiv.org/pdf/2105.05553…
In our current work, we change two things
1: language models rather than classifiers (#MachineLearning -> #NLProc)
2: linguistic generalizations, not specific examples
Get minimal pairs of sentences (figure)
Change only a phenomenon between them
Check which gets a higher probability by the language model
With BLIMP we get 67 groups of sentences.
Each group shares a grammatical phenomenon that changed (e.g. "she" means verb in singular-> plural)
If a model consistently prefers the right one, it learnt this generalization
We find that LMs not only find the same generalizations hard,
but they also do it in the same order (during pertaining).
Changing data (yellow) changes the generalizations only at the beginning
(thus data might affect fine-tuning generalizations, but pertaining is long...)
Order is the same in different architectures too
The performance of a network determines its generalizations. In other words,
there exist better networks, but they bound to learn on the same dimension, only more.
We also analyze some of the things learnt.
For one, those do NOT follow known linguistic categories (other than morphology).
So it doesn't learn quantifiers or syntax, but something else we don't really understand.
Like in humans, we need some theory to predict what would be learnt to be able to effectively research that.
But this, is future work.
So what practical implications does it already have?
One is that we can study what LMs learn during training.
Before that, we didn't know if understanding BERT would be relevant to understanding an LSTM or even RoBERTA.
Now we do, it does.
Go understand learning trajectories!
It all started on ICLR2015(!) @goodfellow_ian@OriolVinyalsML@SaxeLab
Checked points between the converged model and the random initialization.
They found that the loss between them is monotonically decreasing.
Why shouldn't it?
Well... The real question is why should it.
If the loss terrain is anything but a slope, we would expect bumps. Maybe there are different sinks (local minima), or you need to get a bad model before you reach the best model (topologically, you are in a ditch)
Model combination\ensembling:
Average ensembling is practical - but naive.
Combine considering each network's strengths, much better!
Moreover, let's make the networks diverse so they will have different strengths.
The basic idea is quite simple:
Given some models, why would we want the average? We want to rely on each one(or group) when it is more likely to be the correct one.
This was actually introduced in our previous work (as admitted by the authors) in aclanthology.org/W19-4414.pdf
The paper's addition: 1. Given a set of black-box models we may train at least one of them to be different from the rest with RL. 2. we can use more sophisticated NNs to combine the outputs 3. we can ignore domain knowledge for the combination (I am not sure this is a bonus)
Ever since MAEGE (aclanthology.org/P18-1127/) I have a soft spot for evaluation of evaluation = EoE (especially when they are automatic, but without is still ok).
Capture formality - XLM-R with regression not classification
Preservation - with chrf not BLEU
Fluency - XLM-R but there is room for improvement
System Ranking - XLM-R and chrf
Crosslingual Transfer - rely on zero shot not machine translation