简体   繁体   English

PyTorch autograd:自定义函数梯度的维度?

[英]PyTorch autograd: dimensionality of custom function gradients?

Question summary: How is the dimensionality of inputs and outputs handled in the backward pass of custom functions?问题总结:自定义函数的backward pass中如何处理输入和输出的维度?

According to the manual , the basic structure of custom functions is the following:根据手册,自定义函数的基本结构如下:

class MyFunc(torch.autograd.Function):

    @staticmethod
    def forward(ctx, input): # f(x) = e^x
        result = input.exp()
        ctx.save_for_backward(result)
        return result

    @staticmethod
    def backward(ctx, grad_output): # df(x) = e^x
        result, = ctx.saved_tensors
        return grad_output * result
        

For a single input and output dimension, this is perfectly fine and works like a charm.对于单个输入和输出维度,这完全没问题,而且效果很好。 But for higher dimensions the backward pass becomes confusing.但是对于更高的维度,向后传递变得混乱。 Apparently, PyTorch only accepts a result of backward that has the same dimensionality as the result of forward (for the same input).显然,PyTorch只接受的结果backward具有相同维数的结果forward (对于相同的输入)。 Returning a wrong shape yields a RuntimeError: Function MyFunc returned an invalid gradient at index 0 - got [*] but expected shape compatible with [*] .返回错误的形状会产生RuntimeError: Function MyFunc returned an invalid gradient at index 0 - got [*] but expected shape compatible with [*] So I am wondering: What does backward actually compute?所以我想知道:向后实际计算什么?

Its not a Jacobian?它不是雅可比行列式吗? For example, when I have a function f(x) = ( f_1(x_1, ... , x_n), ... , f_k(x_1, ... , x_n) ) with n inputs and k outputs, I would expect that a gradient calculation would yield a Jacobian matrix of dimension k*n .例如,当我有一个函数f(x) = ( f_1(x_1, ... , x_n), ... , f_k(x_1, ... , x_n) )具有n输入和k输出时,我希望梯度计算将产生维度为k*n的雅可比矩阵。 However, the PyTorch implementation expects just a vector of dimension n .然而,PyTorch 实现只需要一个维度为n的向量。 So what does the backward result actually mean, it can't be the Jacobian?那么向后结果实际上意味着什么,它不可能是雅可比行列式呢?

And it does not handle batches?它不处理批次? Moreover, what if I would like to push a batch of input vectors through this function, eg an input of dimension b*n with batch size b .此外,如果我想通过这个函数推送一批输入向量怎么办,例如一个维度为b*n且批量大小为b Then, instead of something like b*k*n the gradient is expected to also have the shape b*n .然后,而不是像b*k*n之类的东西,梯度也应该具有形状b*n Is it even intended to consider the processing of batches with custom functions?它甚至打算考虑使用自定义函数处理批处理吗?

None of these questions seems to be addressed in the manual and the provided examples are very simple, which does not help at all.手册中似乎没有解决这些问题,并且提供的示例非常简单,根本没有帮助。 Maybe there are formulas hidden somewhere that explain the background of the provided Function interface in more detail, but I haven't found them yet.也许隐藏在某处的公式更详细地解释了提供的Function接口的背景,但我还没有找到它们。

It does not store/return the Jacobian (I imagine it is related to memory consideration).它不存储/返回雅可比行列式(我想这与内存考虑有关)。

From a training perspective, we do not need the Jacobian for updating parameters/back-propagating further.从训练的角度来看,我们不需要雅可比矩阵来进一步更新参数/反向传播。

For updating parameters, all we need is dL/dy_j , j<n :对于更新参数,我们只需要dL/dy_j , j<n

y_j -= alpha * dL/dy_j

And for backpropagation to z , say z=f(y)=f(g(x)) :对于z反向传播,假设z=f(y)=f(g(x))

dL/dz_k = dL/dy_j * dy_j/dz_k

One may say that "but we need dy_j/dz_k here!"有人可能会说“但我们这里需要dy_j/dz_k !” -- it is true, but we do not need to store it (just like we do not use the Jacobian of dx_i/dy_j at all in this step). -- 确实如此,但我们不需要存储它(就像我们在这一步中根本不使用dx_i/dy_j的雅可比行列式dx_i/dy_j )。

In other words, the Jacobian is only implicitly used, is not required for the most part, and is therefore do away for the sake of memory.换句话说,雅可比行列式只是隐式使用,大部分不需要,因此为了记忆而取消。

And for the batch part, note that mini-batch learning mostly just averages the gradient.对于批处理部分,请注意小批量学习大多只是平均梯度。 PyTorch expects you to handle it in the backward function (again, such that the function returns at little as possible and saves as much memory as possible). PyTorch 期望您在向后函数中处理它(同样,该函数尽可能少地返回并节省尽可能多的内存)。

Note: One can "gather" the Jacobian and obtain the n -sized vector that you have mentioned.注意:可以“收集”雅可比矩阵并获得您提到的n大小的向量。 Specifically, sum over the k dimension and average over the batch dimension.具体来说,在k维上求和,在批处理维上求平均值。

EDIT: Not 100% sure, but I think the backward call (of f(x)=y) is expected to return this vector:编辑:不是 100% 确定,但我认为向后调用(f(x)=y)预计会返回这个向量: 方程

where \\nabla x is the input argument to backward .其中\\nabla xbackward的输入参数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM