♻️ Leshem Choshen ♻️ Profile picture
🥇 #NLProc researcher 🥈 Opinionatedly Summarizing #ML & #NLP papers 🥉 Good science #scientivism Let's pretrain together @IBMResearch & @MIT_CSAIL
Feb 15 10 tweets 3 min read
DoRA explores the magnitude and direction and
surpasses LoRA quite significantly

This is done with an empirical finding that I can't wrap my head around
@NVIDIAAI

@nbasyl_tw @chienyi_wang @yin_hongxu @PavloMolchanov @CMHungSteven arxiv.org/abs/2402.09353
Image Lora as you probably know learns in addition to W (some dense matrix e.g. the fully connected) AB, two matrices (of low dim between them)
So you learn
W+AB
DORA suggests another parameter, magnitude
m*(W+AB)\(W+AB) Image
May 22, 2023 7 tweets 3 min read
In-Context-Learning == gradient descent or disregards labels completely?!
Why not both?

Models recognize the task but also learn it
& The benefits of actual learning grow with # examples and model size
Jane Pan @gaotianyu1350 @__howardchen @danqi_chen
arxiv.org/abs/2305.09731
Image So we have seen papers showing that models gain a lot from seeing examples (ICL) with random labels
Feb 7, 2023 6 tweets 4 min read
Parallel generation from auto regressive LMs
Para ll el!
Well not exactly, use a fast LM to propose the next words first
arxiv.org/abs/2302.01318
@sebastiangoldt @advani_madhu @SaxeLab @KrzakalaF @zdeborova
@DeepMind The story is very simple
Auto regressive models predict the next word given the last, annoying and - with a strong model - slow
Instead, they propose to use a fast model to predict the next words
Then check on all of those words whether the strong model agrees about them
Feb 7, 2023 5 tweets 4 min read
Chain of Thought for vision
beating GPT3 by 16% and supposedly humans

Text and captions are not enough, but
with vision CoT does really well

@zhangzhuosheng @astonzhangAZ @mli65 Hai Zhao @karypis @smolix
arxiv.org/abs/2302.00923 What is this Chain of Though everyone are talking about?
Just asking the model to explain the answer, that's it.
It is a big deal because sometimes explaining before answering improves the answer
But not here (fig: rational = before and explain = after)
Feb 6, 2023 5 tweets 4 min read
Few-shot learning almost reaches traditional machine translation

Xavier Garcia @whybansal @ColinCherry George Foster, Maxim Krikun @fengfangxiaoyu @melvinjohnsonp @orf_bnw
arxiv.org/abs/2302.01398
#enough2skim #NLProc #neuralEmpty Image The setting is quite simple:
Take a smaller (8B vs 100-500B in some baselines)
bilingual LM in two languages (one might be low resource see fig)
Show it a few translation examples in the prompt
Say abra kadabra 🪄
and you got a very good translation system Image
Dec 5, 2022 9 tweets 5 min read
We want to pretrain🤞
Instead we finetune🚮😔
Could we collaborate?🤗

ColD Fusion:
🔄Recycle finetuning to multitask
➡️evolve pretrained models forever

On 35 datasets
+2% improvement over RoBERTa
+7% in few shot settings
🧵

#NLProc #MachinLearning #NLP #ML #modelRecyclying We all wish to improve pretraining
If only we had unlimited compute and data...
Together we have!

We propose a way to recycle finetuning
and transform it into multitask learning!

arxiv.org/abs/2212.01378

@Shachar_Don @VenezianElad @colinraffel @noamslonim @YoavKatz73 me
Oct 26, 2022 31 tweets 9 min read
Is data really important for pretraining?
Could we just pretrain on 1 picture? Only synthetic text? Fractals?
A 🧵 summing the image and text papers that do just that.
and they all have a similar conclusion🤔 Image The main idea behind pretraining claims that, given some hard enough loss, we can train on a lot of data and learn how the world works so well, that we could easily transfer this knowledge to perform the tasks we really care about.
Mar 22, 2022 27 tweets 11 min read
About generalization of different networks

Main finding: Generalization in pretraining follows a single dimension
Different networks, architectures, seeds, sizes but:
Similar performance → similar linguistic capabilities

@aclmeeting accepted (#NLProc)

Summary & story 🧵 It all began in a discussion of
C. Zhang, S. Bengio, @mrtz, @beenwrekt, @OriolVinyalsML
Fascinating work.
arxiv.org/abs/1611.03530

More about their work
Mar 21, 2022 22 tweets 8 min read
During training, your loss goes up and down up and down up and down.

But how would it go if you magically went in a straight line
from init to learnt position?

Apparently smoothly down!

On the surprising Linear Interpolation:
#scientivism #deepRead #MachineLearning It all started on ICLR2015(!)
@goodfellow_ian @OriolVinyalsML @SaxeLab
Checked points between the converged model and the random initialization.
They found that the loss between them is monotonically decreasing.
Feb 9, 2022 15 tweets 7 min read
I have just found a new phenomenon:
Linear mode connectivity

What is the loss of the mid-model?
A model somewhere between converged models with different seeds?

#MachineLearning Image Take two models, put them in the loss space
The points between them are the mode connectivity.

If the models converge into different solutions\loss pits (blue), then there is a barrier between them, called "energy barrier" (yellow). Image
Oct 31, 2021 10 tweets 3 min read
Model combination\ensembling:
Average ensembling is practical - but naive.
Combine considering each network's strengths, much better!
Moreover, let's make the networks diverse so they will have different strengths.

Wenjuan Han & Hwee Tou Ng (no twitters?)
#enough2skim #NLProc The basic idea is quite simple:
Given some models, why would we want the average? We want to rely on each one(or group) when it is more likely to be the correct one.
This was actually introduced in our previous work (as admitted by the authors) in
aclanthology.org/W19-4414.pdf
Oct 24, 2021 6 tweets 3 min read
Ever since MAEGE (aclanthology.org/P18-1127/) I have a soft spot for evaluation of evaluation = EoE (especially when they are automatic, but without is still ok). EoE for style transfer in multiple languages.
@ebriakou, @swetaagrawal20, @Tetreault_NLP, @MarineCarpuat
arxiv.org/pdf/2110.10668…
They end up with the following best practices:
Oct 20, 2021 15 tweets 4 min read
Are all language orders as hard?
Supposedly, for RNNs yes, for Transformers no

@JenniferCWhite @sleepinyourhat
aclanthology.org/2021.acl-long.…

github.com/rycolab/artifi… (currently empty)
#NLProc
Really cool, with a caveat Image The paper creates synthetic languages (using a PCFG) with various ordering rules, being able to compare each order. Image
Aug 25, 2021 11 tweets 3 min read
A product of an unlikely collaboration, which I am thankful for:

When NLP and code researchers meet Huge 𝙘𝙤𝙢𝙢𝙞𝙩 𝙨𝙪𝙢𝙢𝙖𝙧𝙞𝙯𝙖𝙩𝙞𝙤𝙣 dataset

@HujiIdan and myself
arxiv.org/abs/2108.10763