简体   繁体   中英

Understanding higher order automatic differentiation

Having recently just finished my own basic reverse mode AD for machine learning purposes, I find myself wanting to learn about the field, but I've hit a hardness wall with higher order methods.

The basic reverse AD is beautifully simple and easy to understand, but the more advanced material is both too abstract, too technical and I have not been able to find any good explanations of it on the Internet (in fact it took me quite a bit to realize basic reverse AD even exists.)

Basically, I understand how to take the second derivatives in the context of calculus, but I do not understand how to transform a reverse AD graph to get second order derivatives.

In an algorithm like edge_pushing just what do those dashed connections mean?

I've investigated the library DiffSharp and I've noted that it uses something like forward-on-reverse differentiation for calculating the Hessian. Running, that through the debugger, I've really seen that it does in fact mix forward and reverse steps in a single run. What are the principles behind that mechanism?

DiffSharp uses the jacobian-vector product to calculate the Hessian, for each variable, which is a R^m -> R^n mapping. How is that possible to get that from the original graph? Reverse AD is a R -> R^n mapping, from where do the extra dimensions come from?

Lastly, how does nested AD work?

I wrote the paper on edge_pushing. First you start with the computational graph of the gradient. And what I mean by gradient here is the computational graph of the reverse gradient method. The edge_pushing algorithm is then simply applying the reverse gradient algorithm to the this gradient graph, which would give you a Hessian. The catch here is that it does this in an intelligent way. In particular, the dotted edges are artificially added edges that represent a nonlinear interaction between two nodes (both nodes are inputs of a nonlinear function further up the graph). The nonlinear dotted edges make it easy to visualize where the major costs of calculating this reverse gradient on the gradient graph occur, and how to best accumulate the total derivative. Does that help?

I wrote a tutorial for AD that shows briefly how to do forward along with reverse here near the end. I also wrote an entire library for basic AD on the GPU that can be found linked at the same site.

Still not sure about edge_pushing, but I do not think it matters much for neural nets at any rate.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM