 Sr Principal Research Manager at Microsoft Research. Broadly interested in decision making under uncertainty (ML, convex optimization, online algorithms). 9 Jun
We may have found a solid hypothesis to explain why extreme overparametrization is so helpful in #DeepLearning, especially if one is concerned about adversarial robustness. arxiv.org/abs/2105.12806
1/7 With my student extraordinaire Mark Sellke @geoishard, we prove a vast generalization of our conjectured law of robustness from last summer, that there is an inherent tradeoff between # neurons and smoothness of the network (see *pre-solution* video). 2/7
If you squint hard enough (eg, like a physicist) our new universal law of robustness even makes concrete predictions for real data. For ex. we predict that on ImageNet you need at least 100 billion parameters (i.e., GPT-3-like scale) to possibly attain good robust guarantees. 3/7 26 Jan
Interesting thread! To me the ``reason" for CLT is simply high-dim geometry. Consider unit ball in dim n+1 & slice it at distance x from the origin to get a dim n ball of radius (1-x^2)^{1/2}. The volume of the slice is prop to (1-x^2)^{n/2}~exp(-(1/2)n x^2). Tada the Gaussian!!
In other words, for a random point in the ball, the marginal in any direction will converge to a Gaussian (one line calc!). Maybe this doesn't look like your usual CLT. But consider Bernoulli CLT: 1/sqrt(n) sum_i X_i = <X, u>, with X random in {-1,1}^n & u=1/sqrt(n)*(1,..,1).
That is, the Bernoulli CLT is just about the marginal in the direction u of a random point in the hypercube! So instead of geometry of the ball as in first tweet, we need to consider geometry of the cube. But it turns out that all geometries are roughly the same!