简体   繁体   中英

How does automatic differentiation with respect to the input work?

I've been trying to understand how automatic differentiation (autodiff) works. There are several implementations of this that can be found in Tensorflow , PyTorch and other programs.

There are three aspects of automatic differentiation that currently seem vague to me.

  1. The exact process used to calculate the gradients
  2. How autodiff works with respect to inputs
  3. How autodiff works with respect to a singular value as input

So far, it seems to roughly follow the following steps:

  1. Break up original function into elementary operations (individual arithmetic operations, composition and function calls).
  2. The elementary operations are combined to form a computational graph in such a way that the original function can be calculated using the computational graph.
  3. The computational graph is executed for a certain input, and each operation is recorded
  4. Walking through the recorded operations in reverse using the chain rule gives us the gradient.

First of all, is this a correct overview of the steps that are taken in automatic differentiation?

Secondly, how would the above process work for a derivative with respect to the inputs. For instance, a functiony=x^2 would need a difference in the x value. Does that mean that the derivative can only be calculated after at least two different x values have been provided as the input? Or does it require multiple inputs at once (ie vector input) over which it can calculate a difference? And how does this compare when we calculate the gradient with respect to the model weights (ie as done in backpropagation).

Thirdly, how can we take the derivative of a singular value. Take, for instance, the following Python code where the derivative ofy=x^2is calculated:

x = tf.constant(3.0)
with tf.GradientTape() as tape:
  tape.watch(x)
  y = x**2

# dy = 2x * dx
dy_dx = tape.gradient(y, x)
print(dy_dx.numpy()) # prints: '6.0'

Since dx is the difference between several x inputs, would that not mean that dx = 0 ?


I found that this paper had a pretty good overview of the various modes of autodiff. As well as the differences as compared to numerical and symbolic differentiation. However, it did not bring a full understanding and I would still like to understand the autodiff process in context of these traditional differentiation techniques.

Rather than applying it practically, I would love to get a more theoretical understanding.

I had similar questions in my mind a few weeks ago until I started to code my own Automatic Differentiation package tensortrax in Python. It uses forward-mode AD with a hyper-dual number approach. I wrote a Readme (landing page of the repository, section Theory ) with an example which could be of interest for you.

I think what you need to understand first is what is a derivative , many math textbooks could help you with that. The notation dx means an infinitesimal variation, so you not actually compute any difference, but do a symbolic operation on your function f that transforms it to a function f' also noted df/dx, which you then apply at any point where it is defined.

Regarding the algorithm used for automatic differentiation, you understood it right, the part that you seem to be missing is how the derivatives of elementary operations are computed and what do they mean, but it would be hard to do a crash course about that in a SO answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM