Few-shot learning almost reaches traditional machine translation

Xavier Garcia @whybansal @ColinCherry George Foster, Maxim Krikun @fengfangxiaoyu @melvinjohnsonp @orf_bnw
arxiv.org/abs/2302.01398
#enough2skim #NLProc #neuralEmpty Image
The setting is quite simple:
Take a smaller (8B vs 100-500B in some baselines)
bilingual LM in two languages (one might be low resource see fig)
Show it a few translation examples in the prompt
Say abra kadabra 🪄
and you got a very good translation system Image
They reproduce known results and detail how to do it (especially for low resource)
e.g., continue previous training for speed\stability have many epochs on the monolingual training (fig)
etc. Image
Also, they replicate the result that quality of the few shot examples matters and state that
matching the style to the required output also matters (matched vs mismatched bottom row in fig ) ImageImage
I was left wondering, if this is so good, would also training with supervision, back translation and everything else won't make it much better? If not, than this is just a tradeoff stating how hard it is to improve? If it is, than, maybe worth the effort for some.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with ♻️ Leshem Choshen ♻️

♻️ Leshem Choshen ♻️ Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @LChoshen

Feb 7
Parallel generation from auto regressive LMs
Para ll el!
Well not exactly, use a fast LM to propose the next words first
arxiv.org/abs/2302.01318
@sebastiangoldt @advani_madhu @SaxeLab @KrzakalaF @zdeborova
@DeepMind Image
The story is very simple
Auto regressive models predict the next word given the last, annoying and - with a strong model - slow
Instead, they propose to use a fast model to predict the next words
Then check on all of those words whether the strong model agrees about them
For a bit more details:
q - strong model
p - poor model
p generated x1 .. xn words
q then calculates their probabilities (did I say on parallel?)
We accept them if q gives high probability (eq in fig) Image
Read 6 tweets
Feb 7
Chain of Thought for vision
beating GPT3 by 16% and supposedly humans

Text and captions are not enough, but
with vision CoT does really well

@zhangzhuosheng @astonzhangAZ @mli65 Hai Zhao @karypis @smolix
arxiv.org/abs/2302.00923 Image
What is this Chain of Though everyone are talking about?
Just asking the model to explain the answer, that's it.
It is a big deal because sometimes explaining before answering improves the answer
But not here (fig: rational = before and explain = after) Image
To get better results in the multimodal task, they...
Wait for it...
Use both modalities
Inserting the image representation too improves the rational
but also the answers
(more on how in fig, actual arch in paper) Image
Read 5 tweets
Dec 5, 2022
We want to pretrain🤞
Instead we finetune🚮😔
Could we collaborate?🤗

ColD Fusion:
🔄Recycle finetuning to multitask
➡️evolve pretrained models forever

On 35 datasets
+2% improvement over RoBERTa
+7% in few shot settings
🧵

#NLProc #MachinLearning #NLP #ML #modelRecyclying
We all wish to improve pretraining
If only we had unlimited compute and data...
Together we have!

We propose a way to recycle finetuning
and transform it into multitask learning!

arxiv.org/abs/2212.01378

@Shachar_Don @VenezianElad @colinraffel @noamslonim @YoavKatz73 me
How to perform multitasking, by simply uploading models?

Collaborative Descent (ColD) Fusion is simple:
Start from a pretrained model
Let contributors finetune on it, and share their models
Fuse the models to get a new better model
Take the improved model as the new best model
Read 9 tweets
Mar 22, 2022
About generalization of different networks

Main finding: Generalization in pretraining follows a single dimension
Different networks, architectures, seeds, sizes but:
Similar performance → similar linguistic capabilities

@aclmeeting accepted (#NLProc)

Summary & story 🧵
It all began in a discussion of
C. Zhang, S. Bengio, @mrtz, @beenwrekt, @OriolVinyalsML
Fascinating work.
arxiv.org/abs/1611.03530

More about their work
We wondered:
Why do networks learn but hardly overfit?
Why overfitting training doesn't hurt test?

It means VC dimension is a really bad way to think about learning
Read 27 tweets
Mar 21, 2022
During training, your loss goes up and down up and down up and down.

But how would it go if you magically went in a straight line
from init to learnt position?

Apparently smoothly down!

On the surprising Linear Interpolation:
#scientivism #deepRead #MachineLearning
It all started on ICLR2015(!)
@goodfellow_ian @OriolVinyalsML @SaxeLab
Checked points between the converged model and the random initialization.
They found that the loss between them is monotonically decreasing.
Why shouldn't it?
Well... The real question is why should it.

If the loss terrain is anything but a slope, we would expect bumps. Maybe there are different sinks (local minima), or you need to get a bad model before you reach the best model (topologically, you are in a ditch)
Read 22 tweets
Feb 9, 2022
I have just found a new phenomenon:
Linear mode connectivity

What is the loss of the mid-model?
A model somewhere between converged models with different seeds?

#MachineLearning Image
Take two models, put them in the loss space
The points between them are the mode connectivity.

If the models converge into different solutions\loss pits (blue), then there is a barrier between them, called "energy barrier" (yellow). Image
Apparently, there is 𝘴𝘰𝘮𝘦 path (connectivity) where the loss stays almost the same. It is also relatively simple.
arxiv.org/pdf/1802.10026…
@tim_garipov @Pavel_Izmailov @FullStackML @andrewgwils
arxiv.org/pdf/1803.00885…
@FelixDrRelax @FredHamprecht Image
Read 15 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(