Leshem (Legend) Choshen 🤖🤗's Threads

Feb 15, 2024 • 10 tweets • 3 min read

DoRA explores the magnitude and direction and
surpasses LoRA quite significantly

This is done with an empirical finding that I can't wrap my head around
@NVIDIAAI

@nbasyl_tw @chienyi_wang @yin_hongxu @PavloMolchanov @CMHungSteven arxiv.org/abs/2402.09353

Lora as you probably know learns in addition to W (some dense matrix e.g. the fully connected) AB, two matrices (of low dim between them)
So you learn
W+AB
DORA suggests another parameter, magnitude
m*(W+AB)\(W+AB)

May 22, 2023 • 7 tweets • 3 min read

In-Context-Learning == gradient descent or disregards labels completely?!
Why not both?

Models recognize the task but also learn it
& The benefits of actual learning grow with # examples and model size
Jane Pan @gaotianyu1350 @__howardchen @danqi_chen
arxiv.org/abs/2305.09731

So we have seen papers showing that models gain a lot from seeing examples (ICL) with random labels

https://twitter.com/sewon__min/status/1498334321869934595?s=20

Feb 7, 2023 • 6 tweets • 4 min read

Parallel generation from auto regressive LMs
Para ll el!
Well not exactly, use a fast LM to propose the next words first
arxiv.org/abs/2302.01318
@sebastiangoldt @advani_madhu @SaxeLab @KrzakalaF @zdeborova
@DeepMind

The story is very simple
Auto regressive models predict the next word given the last, annoying and - with a strong model - slow
Instead, they propose to use a fast model to predict the next words
Then check on all of those words whether the strong model agrees about them

Feb 7, 2023 • 5 tweets • 4 min read

Chain of Thought for vision
beating GPT3 by 16% and supposedly humans

Text and captions are not enough, but
with vision CoT does really well

@zhangzhuosheng @astonzhangAZ @mli65 Hai Zhao @karypis @smolix
arxiv.org/abs/2302.00923

What is this Chain of Though everyone are talking about?
Just asking the model to explain the answer, that's it.
It is a big deal because sometimes explaining before answering improves the answer
But not here (fig: rational = before and explain = after)

Feb 6, 2023 • 5 tweets • 4 min read

Few-shot learning almost reaches traditional machine translation

Xavier Garcia @whybansal @ColinCherry George Foster, Maxim Krikun @fengfangxiaoyu @melvinjohnsonp @orf_bnw
arxiv.org/abs/2302.01398
#enough2skim #NLProc #neuralEmpty

The setting is quite simple:
Take a smaller (8B vs 100-500B in some baselines)
bilingual LM in two languages (one might be low resource see fig)
Show it a few translation examples in the prompt
Say abra kadabra 🪄
and you got a very good translation system

Dec 5, 2022 • 9 tweets • 5 min read

We want to pretrain🤞
Instead we finetune🚮😔
Could we collaborate?🤗

ColD Fusion:
🔄Recycle finetuning to multitask
➡️evolve pretrained models forever

On 35 datasets
+2% improvement over RoBERTa
+7% in few shot settings
🧵

#NLProc #MachinLearning #NLP #ML #modelRecyclying

We all wish to improve pretraining
If only we had unlimited compute and data...
Together we have!

We propose a way to recycle finetuning
and transform it into multitask learning!

arxiv.org/abs/2212.01378

@Shachar_Don @VenezianElad @colinraffel @noamslonim @YoavKatz73 me

Oct 26, 2022 • 31 tweets • 9 min read

Is data really important for pretraining?
Could we just pretrain on 1 picture? Only synthetic text? Fractals?
A 🧵 summing the image and text papers that do just that.
and they all have a similar conclusion🤔

The main idea behind pretraining claims that, given some hard enough loss, we can train on a lot of data and learn how the world works so well, that we could easily transfer this knowledge to perform the tasks we really care about.

Mar 22, 2022 • 27 tweets • 11 min read

About generalization of different networks

Main finding: Generalization in pretraining follows a single dimension
Different networks, architectures, seeds, sizes but:
Similar performance → similar linguistic capabilities

@aclmeeting accepted (#NLProc)

Summary & story 🧵 It all began in a discussion of
C. Zhang, S. Bengio, @mrtz, @beenwrekt, @OriolVinyalsML
Fascinating work.
arxiv.org/abs/1611.03530

More about their work

https://twitter.com/ericjang11/status/912035613766852608?s=20&t=V22fZDXwLh5G4I10Td4rNQ

Mar 21, 2022 • 22 tweets • 8 min read

During training, your loss goes up and down up and down up and down.

But how would it go if you magically went in a straight line
from init to learnt position?

Apparently smoothly down!

On the surprising Linear Interpolation:
#scientivism #deepRead #MachineLearning

It all started on ICLR2015(!)
@goodfellow_ian @OriolVinyalsML @SaxeLab
Checked points between the converged model and the random initialization.
They found that the loss between them is monotonically decreasing.

Feb 9, 2022 • 15 tweets • 7 min read

I have just found a new phenomenon:
Linear mode connectivity

What is the loss of the mid-model?
A model somewhere between converged models with different seeds?

#MachineLearning

Take two models, put them in the loss space
The points between them are the mode connectivity.

If the models converge into different solutions\loss pits (blue), then there is a barrier between them, called "energy barrier" (yellow).

Oct 31, 2021 • 10 tweets • 3 min read

Model combination\ensembling:
Average ensembling is practical - but naive.
Combine considering each network's strengths, much better!
Moreover, let's make the networks diverse so they will have different strengths.

Wenjuan Han & Hwee Tou Ng (no twitters?)
#enough2skim #NLProc

The basic idea is quite simple:
Given some models, why would we want the average? We want to rely on each one(or group) when it is more likely to be the correct one.
This was actually introduced in our previous work (as admitted by the authors) in
aclanthology.org/W19-4414.pdf

Oct 24, 2021 • 6 tweets • 3 min read

Ever since MAEGE (aclanthology.org/P18-1127/) I have a soft spot for evaluation of evaluation = EoE (especially when they are automatic, but without is still ok). EoE for style transfer in multiple languages.
@ebriakou, @swetaagrawal20, @Tetreault_NLP, @MarineCarpuat
arxiv.org/pdf/2110.10668…
They end up with the following best practices:

Oct 20, 2021 • 15 tweets • 4 min read

Are all language orders as hard?
Supposedly, for RNNs yes, for Transformers no

@JenniferCWhite @sleepinyourhat
aclanthology.org/2021.acl-long.…

github.com/rycolab/artifi… (currently empty)
#NLProc
Really cool, with a caveat

The paper creates synthetic languages (using a PCFG) with various ordering rules, being able to compare each order.

Aug 25, 2021 • 11 tweets • 3 min read

A product of an unlikely collaboration, which I am thankful for:

When NLP and code researchers meet Huge 𝙘𝙤𝙢𝙢𝙞𝙩 𝙨𝙪𝙢𝙢𝙖𝙧𝙞𝙯𝙖𝙩𝙞𝙤𝙣 dataset

@HujiIdan and myself
arxiv.org/abs/2108.10763

Share this page!

Enter URL or ID to Unroll