The books I mentioned on @xai spaces are "Linear Algebra Done Right" by Axler and "Naive Set Theory" by Halmos. Other math books that I really enjoyed over the years:
"Introduction to Algorithms" by Thomas H. Cormen & Charles E. Leiserson & Ronald L.… twitter.com/i/web/status/1…
I have about 400+ books in my collection from which these come from. I can dump them here if people are really interested.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
2/ The idea is actually really simple: in a special parametrization introduced in arxiv.org/abs/2011.14522 called µP, narrow and wide neural networks share the same set of optimal hyperparameters. This works even as width -> ∞.
3/ The hyperparameters can include learning rate, learning rate schedule, initialization, parameter multipliers, and more, even individually for each parameter tensor. We empirically verified this on Transformers up to width 4096.
1/ The histogram of eigenvals in a large random symmetric matrix ≈ a semicircle!! So sick! This "Semicircle Law" is essentially "Central Limit" for rand symmetric mats (even more elegant bc u knew what a semicircle is by 1st grade, but wtf was a Gaussian?). Let me tell ya why
2/ Recall the Fourier transform way of showing central limit theorem: For iid X1, ..., Xk ~ distribution P, the characteristic function of
(X1 + ... + Xk)/sqrt(k)
is
F(t/sqrt(k))^k,
where F is the characteristic function of P.
3/ By the property of characteristic function, we have
F(0) = 1, F'(0) = mean(P)*i, F''(0) = -var(P).
So if P has mean 0 and variance 1, then for large k, we can Taylor expand
F(t) ≈ 1 - t^2/2k + ...
2/4 More precisely, classic theory goes like this "when we train using xent loss, we get good pop loss by early stopping b4 valid loss 🡅. B/c xent is a good proxy for 0-1 loss, we expect good pop accuracy from this procedure." But here we got good acc w/o getting good pop loss
3/4 Practically, this is no biggie if we can track some quality metric like accuracy. Butwhatabout e.g. language modeling that only tracks loss/ppl? How do we know the NN doesn't learn great language long after val loss blows up? @srush_nlp@ilyasut@nlpnoah@colinraffel@kchonyc
1/ Why do wide, random neural networks form Gaussian processes, *regardless of architecture*? Let me give an overview in case you are too lazy to check out the paper arxiv.org/abs/1910.12478 or the code github.com/thegregyang/GP…. The proof has two parts…
2/ Part 1 shows that any architecture can be expressed as a principled combination of matrix multiplication and nonlinearity application; such a combination is called a *tensor program*. The image shows an example. Thread 👉
3/ Part 2 shows that any such tensor program has a “mean field theory” or an “infinite width limit.” Using this, we can show that for any neural network, the kernel of last layer embeddings of inputs converges to a deterministic kernel. Thread 👉
1/ Does batchnorm make optimization landscape more smooth? arxiv.org/abs/1805.11604 says yes, but our new @iclr2019 paper arxiv.org/abs/1902.08129 shows BN causes grad explosion in randomly initialized deep BN net. Contradiction? We clarify below
3/ The former shows grad wrt weight & preactivation *right before* BN is smaller than without BN, if the batch variance is large (which they empirically find to be true in training). But this result says nothing about gradients of lower layers, or stacking of BNs.