Greg Yang Profile picture
Jul 14 2 tweets 1 min read Twitter logo Read on Twitter
Since folks are asking:

The books I mentioned on @xai spaces are "Linear Algebra Done Right" by Axler and "Naive Set Theory" by Halmos. Other math books that I really enjoyed over the years:

"Introduction to Algorithms" by Thomas H. Cormen & Charles E. Leiserson & Ronald L.… twitter.com/i/web/status/1…
I have about 400+ books in my collection from which these come from. I can dump them here if people are really interested.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Greg Yang

Greg Yang Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @TheGregYang

Mar 8, 2022
1/ You can't train GPT-3 on a single GPU, much less tune its hyperparameters (HPs).

But what if I tell you…

…you *can* tune its HPs on a single GPU thanks to new theoretical advances?

paper arxiv.org/abs/2203.03466
code github.com/microsoft/mup
blog microsoft.com/en-us/research… Image
2/ The idea is actually really simple: in a special parametrization introduced in arxiv.org/abs/2011.14522 called µP, narrow and wide neural networks share the same set of optimal hyperparameters. This works even as width -> ∞. Image
3/ The hyperparameters can include learning rate, learning rate schedule, initialization, parameter multipliers, and more, even individually for each parameter tensor. We empirically verified this on Transformers up to width 4096. Image
Read 15 tweets
Aug 6, 2020
1/ The histogram of eigenvals in a large random symmetric matrix ≈ a semicircle!! So sick! This "Semicircle Law" is essentially "Central Limit" for rand symmetric mats (even more elegant bc u knew what a semicircle is by 1st grade, but wtf was a Gaussian?). Let me tell ya why
2/ Recall the Fourier transform way of showing central limit theorem: For iid X1, ..., Xk ~ distribution P, the characteristic function of
(X1 + ... + Xk)/sqrt(k)
is
F(t/sqrt(k))^k,
where F is the characteristic function of P.
3/ By the property of characteristic function, we have
F(0) = 1, F'(0) = mean(P)*i, F''(0) = -var(P).
So if P has mean 0 and variance 1, then for large k, we can Taylor expand
F(t) ≈ 1 - t^2/2k + ...
Read 10 tweets
May 13, 2020
1/4 WTF guys I think I broke ML: loss & acc 🡅 together! reproduced here github.com/thegregyang/Lo…. Somehow good accuracy is achieved *in spite of* classic generalizn theory (wrt the loss) - What's goin on? @roydanroy @prfsanjeevarora @ShamKakade6 @BachFrancis @SebastienBubeck
2/4 More precisely, classic theory goes like this "when we train using xent loss, we get good pop loss by early stopping b4 valid loss 🡅. B/c xent is a good proxy for 0-1 loss, we expect good pop accuracy from this procedure." But here we got good acc w/o getting good pop loss
3/4 Practically, this is no biggie if we can track some quality metric like accuracy. Butwhatabout e.g. language modeling that only tracks loss/ppl? How do we know the NN doesn't learn great language long after val loss blows up? @srush_nlp @ilyasut @nlpnoah @colinraffel @kchonyc
Read 6 tweets
Dec 5, 2019
1/ Why do wide, random neural networks form Gaussian processes, *regardless of architecture*? Let me give an overview in case you are too lazy to check out the paper arxiv.org/abs/1910.12478 or the code github.com/thegregyang/GP…. The proof has two parts…
2/ Part 1 shows that any architecture can be expressed as a principled combination of matrix multiplication and nonlinearity application; such a combination is called a *tensor program*. The image shows an example. Thread 👉
3/ Part 2 shows that any such tensor program has a “mean field theory” or an “infinite width limit.” Using this, we can show that for any neural network, the kernel of last layer embeddings of inputs converges to a deterministic kernel. Thread 👉
Read 26 tweets
May 2, 2019
1/ Does batchnorm make optimization landscape more smooth? arxiv.org/abs/1805.11604 says yes, but our new @iclr2019 paper arxiv.org/abs/1902.08129 shows BN causes grad explosion in randomly initialized deep BN net. Contradiction? We clarify below
2/ During a visit by @aleks_madry's students @ShibaniSan @tsiprasd @andrew_ilyas to MSR Redmond few weeks ago, we figured out the apparent paradox.
3/ The former shows grad wrt weight & preactivation *right before* BN is smaller than without BN, if the batch variance is large (which they empirically find to be true in training). But this result says nothing about gradients of lower layers, or stacking of BNs.
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(