12,399 views

\mathfrak{Michael "El Muy Muy" Betancourt}

@betanalpha

, 23 tweets, 4 min read

My Authors

Automatic differentiation is introduced in a few different ways, but in my opinion none of them explain what the method actually does mathematically or algorithmically particularly well. Let's fix that with a short (okay probably long) thread!

Let's start with the chain rule of calculus. The chain rule tells us how to construct the derivatives of a _composite_ function from the derivatives of the _component_ functions in a systematic way.

For example consider the component functions function F1: X0 -> X1 and F2 : X1 -> X2, and the composite function F = F2 o F1 : X0 -> X2. The chain rule says that dF_i / dx_j = sum_k d(F2)_i / dx_k * d(F1)_k / dx_j.

That's a little awkward to read, but we can clean it up if we organize the partial derivatives of each function into a _Jacobian matrix_. For example (J_F1)_ij = d(F1)_i / dx_j. Then chain rule takes the much nicer form

J_(F2 o F1) = J_F2 * J_F1.

i.e. the chain rule just says that the Jacobian matrix of a composite function is the matrix product of the Jacobian matrices of the component functions. If we know how to compute those component Jacobians then we could _automate_ the computation of the composite Jacobian!

BUT THIS IS NOT AUTOMATIC DIFFERENTIATION! We could implement an algorithm that does this, but the matrix multiplications are expensive. In fact so expensive that this algorithm would often be too costly to be practical.

Fortunately, there is another way. Instead of trying to compute the composite Jacobian, let's try to compute the _action_ of the Jacobian on a vector in the input space, v1 = J_F1 * v0. In linear algebra matrices _live to act on vectors_!

This action also has a nice interpretation -- it defines a _directional derivative_. v1 quantifies how F1 changes along the direction v0.

Let's see what happens when we try to compute the action of the composite Jacobian:

v2 = J_{F2 o F1} * v0
v2 = (J_F2 * J_F1) * v0 [Chain rule]
v2 = J_F2 * (J_F1 * v0) [Associativity]
v2 = J_F2 * v1

In words we can compute the action of the composite with a sequence of matrix-vector products, O(dim^2), which is much cheaper than trying to construct the composite Jacobian explicitly, O(dim^3), and then computing the action.

So if we have some big composite function

F = F_N o .... o F_1

then we can efficiently compute the direction derivative J_F * v0 by _propagating_ vectors through the component functions,

v_N = J_FN * (J_F(N-1) * (... * (J_F2 * (J_F1 * v0)) ... ).

The is _almost_ forward mode autodiff! Forward mode autodiff takes a computer program, converts each expression in the program into a composite function with its own Jacobian matrix, and then propagates vectors forward along with the program to compute directional derivatives.

*There are some subtle differences between a component function and an expression in a computer program. For first-order direction derivatives the differences don't matter, but they do for higher-order derivatives. See Section 4 of arxiv.org/abs/1812.11592 for more.

Reverse mode autodiff mirrors this process, but for computing the _adjoint action_ that maps vectors from the output space into the input space,

alpha_0 = (J_F)^T * alpha_1.

This defines a different directional derivative that is often more useful in practice.

Now here's where things get really exciting. We've introduced Jacobian-vector products, and the adjoint products, a bit heuristically here, but there is a very deep theoretical foundation waiting for us in...differential geometry!

Everything is geometry!

So the forward propagation of vectors? That's really the _pushforward_ of vectors living in the tangent spaces of each intermediate space. And the reverse propagation is the _pullback_ of covectors living in the cotangent spaces of those spaces.

Differential geometry tells us that we're building up the _best linear approximation_ of the composite function from the best linear approximations of the component functions. Best linear approximations are also known as _first-order Taylor expansions_.

So what we developed heuristically is actually how to build up the first-order Taylor expansion of a composite function from the first-order Taylor expansion of the component functions. That realization helps us to understand how _higher-order_ autodiff works.

It turns that that there's isn't a way to write just the Hessian matrix of partial derivatives of a compute function in terms of the Hessian matrices of the component functions as we were able to do with the Jacobian matrices. First and second-order partial derivatives end up mi

...xing. That's because a Hessian matrix by itself isn't a well-defined geometric object. What is a well-defined geometric object is the second-order Taylor expansion that includes both the first and second-order partial derivatives. This leads us into the theory of _jets_.

I'm definitely not going to talk about jets here -- you can read all about general jets, and the velocity jets that generalize forward mode autodiff and the covelocity jets that generalize reverse mode autodiff in another one of my favorite papers, arxiv.org/abs/1812.11592.

The lesson here is that differential geometry tells us _what_ well-defined objects we can construct from derivatives and _how_ they transform. Instead of trying to work things out heuristically the geometry let's us build up transformation rules, and algorithms, systematically.

Enjoying this thread?

Keep Current with \mathfrak{Michael "El Muy Muy" Betancourt}

Stay in touch and get notified when new unrolls are available from this author!

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Enjoying this thread?

Try unrolling a thread yourself!

More from @betanalpha see all

Related threads

Trending hashtags

Embed code for your website

Did Thread Reader help you today?