We may have found a solid hypothesis to explain why extreme overparametrization is so helpful in #DeepLearning, especially if one is concerned about adversarial robustness. arxiv.org/abs/2105.12806 1/7
With my student extraordinaire Mark Sellke @geoishard, we prove a vast generalization of our conjectured law of robustness from last summer, that there is an inherent tradeoff between # neurons and smoothness of the network (see *pre-solution* video). 2/7
If you squint hard enough (eg, like a physicist) our new universal law of robustness even makes concrete predictions for real data. For ex. we predict that on ImageNet you need at least 100 billion parameters (i.e., GPT-3-like scale) to possibly attain good robust guarantees. 3/7
So what does the law actually says? Classically, interpolating n points with a p-parameters function class only requires p>n. Now what if you want to interpolate *smoothly*? We show that this simple extra robust constraint forces overparametrization! (by a factor d=data dim). 4/7
This result is true for broad class of data: 1) covariate x should be a mixture of ``isoperimetric measures" (e.g., Gaussian), but the key here is that we can allow *many* mixture components (like n/log(n)). 2) target labels y should have some independent noise. Reasonable?! 5/7
The real surprise to me is how general the phenomenon is. Back last summer we struggled to prove it for 2-layers neural nets, but in the end the law applies to *any* (smoothly parametrized) function class. Key to unlock the problem was to adopt a probabilist perspective! 6/7
As always we welcome comments! While the law itself is a rather simple mathematical statement, its interpretation is ofc fairly speculative. In fact you can check a video by @EldanRonen explaining the paper & giving us a hard time on speculative part🤣7/7
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Interesting thread! To me the ``reason" for CLT is simply high-dim geometry. Consider unit ball in dim n+1 & slice it at distance x from the origin to get a dim n ball of radius (1-x^2)^{1/2}. The volume of the slice is prop to (1-x^2)^{n/2}~exp(-(1/2)n x^2). Tada the Gaussian!!
In other words, for a random point in the ball, the marginal in any direction will converge to a Gaussian (one line calc!). Maybe this doesn't look like your usual CLT. But consider Bernoulli CLT: 1/sqrt(n) sum_i X_i = <X, u>, with X random in {-1,1}^n & u=1/sqrt(n)*(1,..,1).
That is, the Bernoulli CLT is just about the marginal in the direction u of a random point in the hypercube! So instead of geometry of the ball as in first tweet, we need to consider geometry of the cube. But it turns out that all geometries are roughly the same!