Here are my notes.
(If some parts are vague or if they assume a familiarity with a concept you're not aware of, please let me know)
Gradients of minibatch of data that are specific about that batch cancel over multiple runs and what remains is gradients that are generally applicable.
But now I think that's not an interesting result. What's interesting is that they give good answers on unseen data.
Effectively, we "throw out" a lot of information to do whatever we want to do.
Different layers might be throwing away irrelevant information while keeping only the relevant info.
One was on information bottleneck: quantamagazine.org/new-theory-cra…
This helps networks generalize as data-specific noise gets ignored in deep networks.
Given the millions of weights and biases, it seems problem is of finding the needle in haystack.
One is related to earlier suggestion of many-to-one mapping of input to output in real world tasks. This means that there may be more than 1 set of parameters that do the job equally well
This is a good PDF (including many other points): ds3-datascience-polytechnique.fr/wp-content/upl…
On top of this, what helps is that in a big deep network there exists many, many subnetworks. And one or more of them might be better positioned to seek that landscape. arxiv.org/abs/1803.03635
My hunch says the utility of depth is related to how stochastic gradient descent works.
But I don't really know.
If you want to dive in, here's a great tutorial:
It seems depth helps with expressivity of neural networks. While shallow networks, in theory, can express any arbitrary function, this requires exponentially more units which in practice is very difficult