Greg Yang Profile picture
make america grok again
Oct 4, 2023 25 tweets 7 min read
Nontrivial ∞width neural nets are either kernel machines or feature learners. Latter's scaling makes optimal hyperparams invariant to width

What if depth→∞as well?

🆕 Feature diversity is key; maxed out by abs (not relu); gives invariance to depth!

But GPT flawed 🧵 What “feature diversity” means here is neighboring layers should NOT do almost the same thing – that’s a waste of capacity!

New paper:
I cannot show enough appreciation to my coauthors @dingli_yu @chenzhucs @hayou_soufiane !!!

More explanations 👇arxiv.org/abs/2310.02244
Aug 9, 2023 17 tweets 4 min read
1/ How to scale hyperparams (eg learning rate) as neural network gets wider? Esp w/ adaptive optimizers like Adam?

I derived the answer (μP) in 2020 & verified it on GPT3

This required some beautiful new math that’s just been completely written down w/ @EtaiLittwin
🧵👇 Image 2/ 📜

I tweeted this few days ago but due to twitter bug and my oversight, some stuff didn't get posted. So here's a do-over!

RT even if you've seen this already!arxiv.org/abs/2308.01814
Mar 8, 2022 15 tweets 9 min read
1/ You can't train GPT-3 on a single GPU, much less tune its hyperparameters (HPs).

But what if I tell you…

…you *can* tune its HPs on a single GPU thanks to new theoretical advances?

paper arxiv.org/abs/2203.03466
code github.com/microsoft/mup
blog microsoft.com/en-us/research… Image 2/ The idea is actually really simple: in a special parametrization introduced in arxiv.org/abs/2011.14522 called µP, narrow and wide neural networks share the same set of optimal hyperparameters. This works even as width -> ∞. Image
Aug 6, 2020 10 tweets 2 min read
1/ The histogram of eigenvals in a large random symmetric matrix ≈ a semicircle!! So sick! This "Semicircle Law" is essentially "Central Limit" for rand symmetric mats (even more elegant bc u knew what a semicircle is by 1st grade, but wtf was a Gaussian?). Let me tell ya why 2/ Recall the Fourier transform way of showing central limit theorem: For iid X1, ..., Xk ~ distribution P, the characteristic function of
(X1 + ... + Xk)/sqrt(k)
is
F(t/sqrt(k))^k,
where F is the characteristic function of P.
May 13, 2020 6 tweets 5 min read
1/4 WTF guys I think I broke ML: loss & acc 🡅 together! reproduced here github.com/thegregyang/Lo…. Somehow good accuracy is achieved *in spite of* classic generalizn theory (wrt the loss) - What's goin on? @roydanroy @prfsanjeevarora @ShamKakade6 @BachFrancis @SebastienBubeck 2/4 More precisely, classic theory goes like this "when we train using xent loss, we get good pop loss by early stopping b4 valid loss 🡅. B/c xent is a good proxy for 0-1 loss, we expect good pop accuracy from this procedure." But here we got good acc w/o getting good pop loss
Dec 5, 2019 26 tweets 10 min read
1/ Why do wide, random neural networks form Gaussian processes, *regardless of architecture*? Let me give an overview in case you are too lazy to check out the paper arxiv.org/abs/1910.12478 or the code github.com/thegregyang/GP…. The proof has two parts… 2/ Part 1 shows that any architecture can be expressed as a principled combination of matrix multiplication and nonlinearity application; such a combination is called a *tensor program*. The image shows an example. Thread 👉
May 2, 2019 9 tweets 5 min read
1/ Does batchnorm make optimization landscape more smooth? arxiv.org/abs/1805.11604 says yes, but our new @iclr2019 paper arxiv.org/abs/1902.08129 shows BN causes grad explosion in randomly initialized deep BN net. Contradiction? We clarify below 2/ During a visit by @aleks_madry's students @ShibaniSan @tsiprasd @andrew_ilyas to MSR Redmond few weeks ago, we figured out the apparent paradox.