We may have found a solid hypothesis to explain why extreme overparametrization is so helpful in #DeepLearning, especially if one is concerned about adversarial robustness. arxiv.org/abs/2105.12806
With my student extraordinaire Mark Sellke @geoishard, we prove a vast generalization of our conjectured law of robustness from last summer, that there is an inherent tradeoff between # neurons and smoothness of the network (see *pre-solution* video). 2/7
If you squint hard enough (eg, like a physicist) our new universal law of robustness even makes concrete predictions for real data. For ex. we predict that on ImageNet you need at least 100 billion parameters (i.e., GPT-3-like scale) to possibly attain good robust guarantees. 3/7
So what does the law actually says? Classically, interpolating n points with a p-parameters function class only requires p>n. Now what if you want to interpolate *smoothly*? We show that this simple extra robust constraint forces overparametrization! (by a factor d=data dim). 4/7
This result is true for broad class of data:
1) covariate x should be a mixture of ``isoperimetric measures" (e.g., Gaussian), but the key here is that we can allow *many* mixture components (like n/log(n)).
2) target labels y should have some independent noise. Reasonable?! 5/7
The real surprise to me is how general the phenomenon is. Back last summer we struggled to prove it for 2-layers neural nets, but in the end the law applies to *any* (smoothly parametrized) function class. Key to unlock the problem was to adopt a probabilist perspective! 6/7
As always we welcome comments! While the law itself is a rather simple mathematical statement, its interpretation is ofc fairly speculative. In fact you can check a video by @EldanRonen explaining the paper & giving us a hard time on speculative part🤣7/7

• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Sebastien Bubeck

Sebastien Bubeck Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @SebastienBubeck

26 Jan
Interesting thread! To me the ``reason" for CLT is simply high-dim geometry. Consider unit ball in dim n+1 & slice it at distance x from the origin to get a dim n ball of radius (1-x^2)^{1/2}. The volume of the slice is prop to (1-x^2)^{n/2}~exp(-(1/2)n x^2). Tada the Gaussian!!
In other words, for a random point in the ball, the marginal in any direction will converge to a Gaussian (one line calc!). Maybe this doesn't look like your usual CLT. But consider Bernoulli CLT: 1/sqrt(n) sum_i X_i = <X, u>, with X random in {-1,1}^n & u=1/sqrt(n)*(1,..,1).
That is, the Bernoulli CLT is just about the marginal in the direction u of a random point in the hypercube! So instead of geometry of the ball as in first tweet, we need to consider geometry of the cube. But it turns out that all geometries are roughly the same!
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!