Cofounder https://t.co/SpHbO7FZNV. Morgan Prize Honorable Mention 2018. Developing the theory of #TensorPrograms and the practice of scaling #neuralnetworks.
Oct 4, 2023 • 25 tweets • 7 min read
Nontrivial ∞width neural nets are either kernel machines or feature learners. Latter's scaling makes optimal hyperparams invariant to width
What if depth→∞as well?
🆕 Feature diversity is key; maxed out by abs (not relu); gives invariance to depth!
But GPT flawed 🧵
What “feature diversity” means here is neighboring layers should NOT do almost the same thing – that’s a waste of capacity!
New paper:
I cannot show enough appreciation to my coauthors @dingli_yu @chenzhucs @hayou_soufiane !!!
1/ The histogram of eigenvals in a large random symmetric matrix ≈ a semicircle!! So sick! This "Semicircle Law" is essentially "Central Limit" for rand symmetric mats (even more elegant bc u knew what a semicircle is by 1st grade, but wtf was a Gaussian?). Let me tell ya why
2/ Recall the Fourier transform way of showing central limit theorem: For iid X1, ..., Xk ~ distribution P, the characteristic function of
(X1 + ... + Xk)/sqrt(k)
is
F(t/sqrt(k))^k,
where F is the characteristic function of P.
May 13, 2020 • 6 tweets • 5 min read
1/4 WTF guys I think I broke ML: loss & acc 🡅 together! reproduced here github.com/thegregyang/Lo…. Somehow good accuracy is achieved *in spite of* classic generalizn theory (wrt the loss) - What's goin on? @roydanroy@prfsanjeevarora@ShamKakade6@BachFrancis@SebastienBubeck2/4 More precisely, classic theory goes like this "when we train using xent loss, we get good pop loss by early stopping b4 valid loss 🡅. B/c xent is a good proxy for 0-1 loss, we expect good pop accuracy from this procedure." But here we got good acc w/o getting good pop loss
Dec 5, 2019 • 26 tweets • 10 min read
1/ Why do wide, random neural networks form Gaussian processes, *regardless of architecture*? Let me give an overview in case you are too lazy to check out the paper arxiv.org/abs/1910.12478 or the code github.com/thegregyang/GP…. The proof has two parts… 2/ Part 1 shows that any architecture can be expressed as a principled combination of matrix multiplication and nonlinearity application; such a combination is called a *tensor program*. The image shows an example. Thread 👉