Tweet

twitter.com/i/web/status/1…

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @TheGregYang

Greg Yang

@TheGregYang

Mar 8, 2022

1/ You can't train GPT-3 on a single GPU, much less tune its hyperparameters (HPs).

But what if I tell you…

…you *can* tune its HPs on a single GPU thanks to new theoretical advances?

paper arxiv.org/abs/2203.03466
code github.com/microsoft/mup
blog microsoft.com/en-us/research…

2/ The idea is actually really simple: in a special parametrization introduced in arxiv.org/abs/2011.14522 called µP, narrow and wide neural networks share the same set of optimal hyperparameters. This works even as width -> ∞.

3/ The hyperparameters can include learning rate, learning rate schedule, initialization, parameter multipliers, and more, even individually for each parameter tensor. We empirically verified this on Transformers up to width 4096.

Read 15 tweets

Greg Yang

@TheGregYang

Aug 6, 2020

1/ The histogram of eigenvals in a large random symmetric matrix ≈ a semicircle!! So sick! This "Semicircle Law" is essentially "Central Limit" for rand symmetric mats (even more elegant bc u knew what a semicircle is by 1st grade, but wtf was a Gaussian?). Let me tell ya why

2/ Recall the Fourier transform way of showing central limit theorem: For iid X1, ..., Xk ~ distribution P, the characteristic function of
(X1 + ... + Xk)/sqrt(k)
is
F(t/sqrt(k))^k,
where F is the characteristic function of P.

3/ By the property of characteristic function, we have
F(0) = 1, F'(0) = mean(P)*i, F''(0) = -var(P).
So if P has mean 0 and variance 1, then for large k, we can Taylor expand
F(t) ≈ 1 - t^2/2k + ...

Read 10 tweets

Greg Yang

@TheGregYang

May 13, 2020

@roydanroy

1/4 WTF guys I think I broke ML: loss & acc 🡅 together! reproduced here github.com/thegregyang/Lo…. Somehow good accuracy is achieved *in spite of* classic generalizn theory (wrt the loss) - What's goin on? @roydanroy @prfsanjeevarora @ShamKakade6 @BachFrancis @SebastienBubeck

2/4 More precisely, classic theory goes like this "when we train using xent loss, we get good pop loss by early stopping b4 valid loss 🡅. B/c xent is a good proxy for 0-1 loss, we expect good pop accuracy from this procedure." But here we got good acc w/o getting good pop loss

@srush_nlp

3/4 Practically, this is no biggie if we can track some quality metric like accuracy. Butwhatabout e.g. language modeling that only tracks loss/ppl? How do we know the NN doesn't learn great language long after val loss blows up? @srush_nlp @ilyasut @nlpnoah @colinraffel @kchonyc

Read 6 tweets

Greg Yang

@TheGregYang

Dec 5, 2019

1/ Why do wide, random neural networks form Gaussian processes, *regardless of architecture*? Let me give an overview in case you are too lazy to check out the paper arxiv.org/abs/1910.12478 or the code github.com/thegregyang/GP…. The proof has two parts…

2/ Part 1 shows that any architecture can be expressed as a principled combination of matrix multiplication and nonlinearity application; such a combination is called a *tensor program*. The image shows an example. Thread 👉

3/ Part 2 shows that any such tensor program has a “mean field theory” or an “infinite width limit.” Using this, we can show that for any neural network, the kernel of last layer embeddings of inputs converges to a deterministic kernel. Thread 👉

Read 26 tweets

Greg Yang

@TheGregYang

May 2, 2019

@iclr2019

1/ Does batchnorm make optimization landscape more smooth? arxiv.org/abs/1805.11604 says yes, but our new @iclr2019 paper arxiv.org/abs/1902.08129 shows BN causes grad explosion in randomly initialized deep BN net. Contradiction? We clarify below

@aleks_madry

2/ During a visit by @aleks_madry's students @ShibaniSan @tsiprasd @andrew_ilyas to MSR Redmond few weeks ago, we figured out the apparent paradox.

3/ The former shows grad wrt weight & preactivation *right before* BN is smaller than without BN, if the batch variance is large (which they empirically find to be true in training). But this result says nothing about gradients of lower layers, or stacking of BNs.

Read 9 tweets

Share this page!

Enter Twitter Thread URL to Unroll

Greg Yang

Try unrolling a thread yourself!

More from @TheGregYang

Greg Yang

Greg Yang

Greg Yang

Greg Yang

Greg Yang

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!