The setting is quite simple:
Take a smaller (8B vs 100-500B in some baselines)
bilingual LM in two languages (one might be low resource see fig)
Show it a few translation examples in the prompt
Say abra kadabra 🪄
and you got a very good translation system
They reproduce known results and detail how to do it (especially for low resource)
e.g., continue previous training for speed\stability have many epochs on the monolingual training (fig)
etc.
Also, they replicate the result that quality of the few shot examples matters and state that
matching the style to the required output also matters (matched vs mismatched bottom row in fig )
I was left wondering, if this is so good, would also training with supervision, back translation and everything else won't make it much better? If not, than this is just a tradeoff stating how hard it is to improve? If it is, than, maybe worth the effort for some.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
The story is very simple
Auto regressive models predict the next word given the last, annoying and - with a strong model - slow
Instead, they propose to use a fast model to predict the next words
Then check on all of those words whether the strong model agrees about them
For a bit more details:
q - strong model
p - poor model
p generated x1 .. xn words
q then calculates their probabilities (did I say on parallel?)
We accept them if q gives high probability (eq in fig)
What is this Chain of Though everyone are talking about?
Just asking the model to explain the answer, that's it.
It is a big deal because sometimes explaining before answering improves the answer
But not here (fig: rational = before and explain = after)
To get better results in the multimodal task, they...
Wait for it...
Use both modalities
Inserting the image representation too improves the rational
but also the answers
(more on how in fig, actual arch in paper)
How to perform multitasking, by simply uploading models?
Collaborative Descent (ColD) Fusion is simple:
Start from a pretrained model
Let contributors finetune on it, and share their models
Fuse the models to get a new better model
Take the improved model as the new best model
Main finding: Generalization in pretraining follows a single dimension
Different networks, architectures, seeds, sizes but:
Similar performance → similar linguistic capabilities
It all started on ICLR2015(!) @goodfellow_ian@OriolVinyalsML@SaxeLab
Checked points between the converged model and the random initialization.
They found that the loss between them is monotonically decreasing.
Why shouldn't it?
Well... The real question is why should it.
If the loss terrain is anything but a slope, we would expect bumps. Maybe there are different sinks (local minima), or you need to get a bad model before you reach the best model (topologically, you are in a ditch)