Hidenori Tanaka Profile picture
Group Leader, NTT Research at Harvard University CBS-NTT Program in "Physics of Intelligence" at Harvard

Dec 6, 2021, 10 tweets

Q. What does Noether’s theorem tell us about the “geometry of deep learning dynamics”?
A. We derive Noether’s Learning Dynamics and show:
”SGD+momentum+BatchNorm+weight decay” = “RMSProp" due to symmetry breaking!

w/ @KuninDaniel
#NeurIPS2021 Paper: bit.ly/3pAEYdk
1/

@KuninDaniel Geometry of data & representations has been central in the design of modern deepnets.
e.g., #GeometricDeepLearning arxiv.org/abs/2104.13478 by @mmbronstein, @joanbruna, @TacoCohen, @PetarV_93

What are the geometric design principles for “learning dynamics in parameter space”?
2/

We develop Lagrangian mechanics of learning by modeling it as the motion of a particle in high-dimensional parameter space. Just like physical dynamics, we can model the trajectory of discrete learning dynamics by continuous-time differential equations.
3/

However, there is still a gap between Newton’s EOM and gradient flow. Thus, we model the effects of finite learning rate as “implicit acceleration”, a complementary route to the "implicit gradient regularization" by @dgtbarrett, Benoit Dherin, @SamuelMLSmith, @sohamde_.
4/

As a result, gradient descent becomes Lagrangian dynamics with a finite learning rate, where the learning rule corresponds to the kinetic energy and the loss function corresponds to the potential energy.
5/

Symmetry properties of this Lagrangian govern the geometry of learning dynamics.
Indeed, modern deep learning architectures introduce an array of symmetries to the loss function as we previously studied in arxiv.org/abs/2012.04728.
6/

By studying the symmetry properties of the kinetic energy, we define “kinetic symmetry breaking”, where the kinetic energy corresponding to the learning rule explicitly breaks the symmetry of the potential energy corresponding to the loss function.
7/

We derive Noether’s Learning Dynamics (NLD), unified equality that holds for any combination of symmetry and learning rules. NLD accounts for damping, the unique symmetries of the loss, and the non-Euclidean metric used in learning rules.
8/

We establish an exact analogy between two seemingly unrelated components of modern deep learning: normalization and adaptive optimization.
Benefits of this broken-symmetry-induced “implicit adaptive optimization” are all empirically confirmed!
9/

Overall, understanding not only when symmetries exist, but how they are broken is essential to discover geometric design principles in neural networks.

For more details see
“Noether’s Learning Dynamics: Role of Symmetry Breaking in Neural Networks": bit.ly/3pAEYdk
10/

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling