Tweet

Sebastien Bubeck

9 Jun, 7 tweets, 3 min read

We may have found a solid hypothesis to explain why extreme overparametrization is so helpful in #DeepLearning, especially if one is concerned about adversarial robustness. arxiv.org/abs/2105.12806
1/7

@geoishard

With my student extraordinaire Mark Sellke @geoishard, we prove a vast generalization of our conjectured law of robustness from last summer, that there is an inherent tradeoff between # neurons and smoothness of the network (see *pre-solution* video). 2/7

If you squint hard enough (eg, like a physicist) our new universal law of robustness even makes concrete predictions for real data. For ex. we predict that on ImageNet you need at least 100 billion parameters (i.e., GPT-3-like scale) to possibly attain good robust guarantees. 3/7

So what does the law actually says? Classically, interpolating n points with a p-parameters function class only requires p>n. Now what if you want to interpolate *smoothly*? We show that this simple extra robust constraint forces overparametrization! (by a factor d=data dim). 4/7

This result is true for broad class of data:
1) covariate x should be a mixture of ``isoperimetric measures" (e.g., Gaussian), but the key here is that we can allow *many* mixture components (like n/log(n)).
2) target labels y should have some independent noise. Reasonable?! 5/7

The real surprise to me is how general the phenomenon is. Back last summer we struggled to prove it for 2-layers neural nets, but in the end the law applies to *any* (smoothly parametrized) function class. Key to unlock the problem was to adopt a probabilist perspective! 6/7

@EldanRonen

As always we welcome comments! While the law itself is a rather simple mathematical statement, its interpretation is ofc fairly speculative. In fact you can check a video by @EldanRonen explaining the paper & giving us a hard time on speculative part🤣7/7

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @SebastienBubeck

Sebastien Bubeck

@SebastienBubeck

26 Jan

https://twitter.com/shoyer/status/1353021554959872001

Interesting thread! To me the ``reason" for CLT is simply high-dim geometry. Consider unit ball in dim n+1 & slice it at distance x from the origin to get a dim n ball of radius (1-x^2)^{1/2}. The volume of the slice is prop to (1-x^2)^{n/2}~exp(-(1/2)n x^2). Tada the Gaussian!!

https://twitter.com/shoyer/status/1353021554959872001

In other words, for a random point in the ball, the marginal in any direction will converge to a Gaussian (one line calc!). Maybe this doesn't look like your usual CLT. But consider Bernoulli CLT: 1/sqrt(n) sum_i X_i = <X, u>, with X random in {-1,1}^n & u=1/sqrt(n)*(1,..,1).

That is, the Bernoulli CLT is just about the marginal in the direction u of a random point in the hypercube! So instead of geometry of the ball as in first tweet, we need to consider geometry of the cube. But it turns out that all geometries are roughly the same!

Read 5 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Sebastien Bubeck

Try unrolling a thread yourself!

More from @SebastienBubeck

Sebastien Bubeck

Did Thread Reader help you today?

Like this author's thread?