🥇 #NLProc researcher
🥈 Opinionatedly Summarizing #ML & #NLP papers
🥉 Good science #scientivism
Let's pretrain together
@IBMResearch & @MIT_CSAIL
Feb 15 • 10 tweets • 3 min read
DoRA explores the magnitude and direction and
surpasses LoRA quite significantly
This is done with an empirical finding that I can't wrap my head around
@NVIDIAAI
@nbasyl_tw @chienyi_wang @yin_hongxu @PavloMolchanov @CMHungSteven arxiv.org/abs/2402.09353
Lora as you probably know learns in addition to W (some dense matrix e.g. the fully connected) AB, two matrices (of low dim between them)
So you learn
W+AB
DORA suggests another parameter, magnitude
m*(W+AB)\(W+AB)
May 22, 2023 • 7 tweets • 3 min read
In-Context-Learning == gradient descent or disregards labels completely?!
Why not both?
Models recognize the task but also learn it
& The benefits of actual learning grow with # examples and model size
Jane Pan @gaotianyu1350 @__howardchen @danqi_chen arxiv.org/abs/2305.09731
So we have seen papers showing that models gain a lot from seeing examples (ICL) with random labels
Parallel generation from auto regressive LMs
Para ll el!
Well not exactly, use a fast LM to propose the next words first arxiv.org/abs/2302.01318 @sebastiangoldt@advani_madhu@SaxeLab@KrzakalaF@zdeborova @DeepMind
The story is very simple
Auto regressive models predict the next word given the last, annoying and - with a strong model - slow
Instead, they propose to use a fast model to predict the next words
Then check on all of those words whether the strong model agrees about them
Feb 7, 2023 • 5 tweets • 4 min read
Chain of Thought for vision
beating GPT3 by 16% and supposedly humans
Text and captions are not enough, but
with vision CoT does really well
@zhangzhuosheng@astonzhangAZ@mli65 Hai Zhao @karypis@smolix arxiv.org/abs/2302.00923
What is this Chain of Though everyone are talking about?
Just asking the model to explain the answer, that's it.
It is a big deal because sometimes explaining before answering improves the answer
But not here (fig: rational = before and explain = after)
Feb 6, 2023 • 5 tweets • 4 min read
Few-shot learning almost reaches traditional machine translation
Is data really important for pretraining?
Could we just pretrain on 1 picture? Only synthetic text? Fractals?
A 🧵 summing the image and text papers that do just that.
and they all have a similar conclusion🤔
The main idea behind pretraining claims that, given some hard enough loss, we can train on a lot of data and learn how the world works so well, that we could easily transfer this knowledge to perform the tasks we really care about.
Mar 22, 2022 • 27 tweets • 11 min read
About generalization of different networks
Main finding: Generalization in pretraining follows a single dimension
Different networks, architectures, seeds, sizes but:
Similar performance → similar linguistic capabilities
I have just found a new phenomenon:
Linear mode connectivity
What is the loss of the mid-model?
A model somewhere between converged models with different seeds?
#MachineLearning
Take two models, put them in the loss space
The points between them are the mode connectivity.
If the models converge into different solutions\loss pits (blue), then there is a barrier between them, called "energy barrier" (yellow).
Oct 31, 2021 • 10 tweets • 3 min read
Model combination\ensembling:
Average ensembling is practical - but naive.
Combine considering each network's strengths, much better!
Moreover, let's make the networks diverse so they will have different strengths.
Wenjuan Han & Hwee Tou Ng (no twitters?) #enough2skim#NLProc
The basic idea is quite simple:
Given some models, why would we want the average? We want to rely on each one(or group) when it is more likely to be the correct one.
This was actually introduced in our previous work (as admitted by the authors) in aclanthology.org/W19-4414.pdf
github.com/rycolab/artifi… (currently empty) #NLProc
Really cool, with a caveat
The paper creates synthetic languages (using a PCFG) with various ordering rules, being able to compare each order.
Aug 25, 2021 • 11 tweets • 3 min read
A product of an unlikely collaboration, which I am thankful for:
When NLP and code researchers meet
Huge 𝙘𝙤𝙢𝙢𝙞𝙩 𝙨𝙪𝙢𝙢𝙖𝙧𝙞𝙯𝙖𝙩𝙞𝙤𝙣 dataset