Profile picture
Paras Chopra @paraschopra
, 19 tweets, 5 min read Read on Twitter
1/ Was reading and thinking about why deep neural networks work so well on "natural" learning tasks (such as image classification).

Here are my notes.

(If some parts are vague or if they assume a familiarity with a concept you're not aware of, please let me know)
2/ First major insight was that the minibatch of data for gradient descent actually helps in generalization on unseen data.

Gradients of minibatch of data that are specific about that batch cancel over multiple runs and what remains is gradients that are generally applicable.
3/ It is known that neural networks are universal function approximators. That is, given a function they can approximate that function with arbitrary accuracy.

But now I think that's not an interesting result. What's interesting is that they give good answers on unseen data.
4/ It is a mystery how that happens but probably the answer lies in not as much about neural networks but the types of datasets we have in natural world and what problems we use neural networks for.
5/ Natural world is full of information, one 1000x1000 px photo has 1 million bits but when we see it, we either see it as a cat or a dog.

Effectively, we "throw out" a lot of information to do whatever we want to do.
6/ To classify a photo, our brain convert a log(1 millio) bits into log(1) bit and the task of a neural network is to find the mapping that "forgets" or "throws" all the information irrrelevant to the task while only retaining info that's useful to us.
7/ Since this log(1 million) to log(1) is a many-to-one function, neural networks might be a really good model for approximating these functions.

Different layers might be throwing away irrelevant information while keeping only the relevant info.
8/ This is suggested by two papers/videos I saw today.

One was on information bottleneck: quantamagazine.org/new-theory-cra…
9/ The other one is how errors introduced in early layers tend to vanish in higher layers: offconvex.org/2018/02/17/gen…
10/ In effect, neural networks are lossy compression algo that compress inputs as much as they can while retaining as much info as possible about the task at hand (classification, prediction)

This helps networks generalize as data-specific noise gets ignored in deep networks.
11/ Okay, so we know what deep networks *might* be doing but the question is how training via gradient descent is able to find the right set of parameters that do this compression.

Given the millions of weights and biases, it seems problem is of finding the needle in haystack.
12/ I honestly don't know and research community also doesn't know. But there are hints.

One is related to earlier suggestion of many-to-one mapping of input to output in real world tasks. This means that there may be more than 1 set of parameters that do the job equally well
13/ So stochastic gradient descent might not be finding the "perfect" set of parameters but it may not matter. The problem we want to solve through neural networks may get solved by many sets of params and SGD may find one of them.
14/ In fact, emperically the landscape of loss function for neural networks on "natural" problems (of image classification, etc.) seems to have a "flat" minima.

This is a good PDF (including many other points): ds3-datascience-polytechnique.fr/wp-content/upl…
15/ So the function we're seeking might be parameterized by many parameters.

On top of this, what helps is that in a big deep network there exists many, many subnetworks. And one or more of them might be better positioned to seek that landscape. arxiv.org/abs/1803.03635
16/ I understand how the width of network may help in exploring what information to throw (by setting weights to zero) and what information to use, but I'm not sure the role of depth.

My hunch says the utility of depth is related to how stochastic gradient descent works.
17/ Perhaps, just perhaps, different layers (depth) helps SGD reduce loss in steps by focusing on few dimensions at once v/s if it is just one very wide layer, SGD has too many dimensions to seek at once.

But I don't really know.
18/ What's fascinating to me is the how easily researchers drop neural networks as function approximators anywhere and everywhere. This just makes it more worthwhile to study dynamics of deep networks.

If you want to dive in, here's a great tutorial:
19/ I was reading more on the role of depth here ds3-datascience-polytechnique.fr/wp-content/upl….

It seems depth helps with expressivity of neural networks. While shallow networks, in theory, can express any arbitrary function, this requires exponentially more units which in practice is very difficult
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Paras Chopra
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!