神经网络：批次版本仿射层反向传播权矩阵更新

Question

I'm trying to follow a book about deep learning but I found some part about affine layer of the network confusing. 我正在尝试阅读有关深度学习的书，但发现有关网络仿射层的某些部分令人困惑。 Say I have a network which accepts some handwriting number( 0 ~ 9 ) images(mnist), which are flatten into one dimensional array, eg np.array([123, 255, 0, ...]) and it will output the scores for each possible output, eg np.array([0., 0., 0.3, 0., 0., 0.6, 0., 0., 0., 0.1]) (So the image may be the number 5 ). 说我有它接受一些笔迹数（网络0 ~ 9 ）的图像（MNIST），其被压扁到一个维阵列，例如np.array([123, 255, 0, ...])和它将输出对每个可能的输出进行评分，例如np.array([0., 0., 0.3, 0., 0., 0.6, 0., 0., 0., 0.1]) （因此图像可能是数字5 ）。

Here is my implementation of affine layer: 这是我的仿射层的实现：

class AffineLayer(object):
    ...
    def backward(self, dy):
        dx = np.dot(dy, self._W.T)
        self._dW = np.dot(self._x.T, dy) # Question related part
        self._db = np.sum(dy, axis=0)
        dx = dx.reshape(self._original_x_shape)
        return dx
    ...

Here is some explanations: 这里有一些解释：

self._W is the weight matrix. self._W是权重矩阵。
The part of concern here is self._dW = np.dot(self._x.T, y) # Question related part . 这里需要关注的部分是self._dW = np.dot(self._x.T, y) # Question related part 。
This line is derived from an equality: 这行是从等式得出的：
X * W + B = Y
(N,2) matrix product (2,3) (1,3) (N,3).
The notation (2,) comes from the X.shape of numpy.array , etc. To simplify my problem I choose these dimension numbers. 的符号(2,)来自X.shape的numpy.array等。为了简化我的问题，我选择这些维数。

End of terminology, now here comes the question: 术语结束，现在出现问题：

By some math(omitted), we can arrive the equality used in back-propagation,( so self._dW = np.dot(self._x.T, y) is used in the code): 通过一些数学运算（省略），我们可以得出反向传播中使用的等式（因此在代码中使用self._dW = np.dot(self._x.T, y) ）：

d LT d L
--- == X * ---
d W d Y

(2,3) (2,N) * (N,3).

Notice that no matter how I adjust N , which is the size of the batch, the dimension of dL/dW , the partial-derivative-of-L weight-matrix, won't change, and it's always (2,3) . 请注意，无论我如何调整N （即批的大小）， dL/dW的大小，L的偏导数权重矩阵，都不会改变，并且始终为(2,3) 。

Does this mean that the total effect of these N batches are combined/condensed into the dL/dW ? 这是否意味着将这N批次的总效果合并/压缩为dL/dW ？ This is related to how I would implement the output layer, eg softmax-cross-entropy layer as the final layer. 这与我将如何实现输出层（例如softmax-cross-entropy层）作为最终层有关。 My current conclusion for this is that N batch means doing the back-propagation N times and the division of the gradient dL/dW by N is needed to average/amortize the total effect of that batch. 我目前的结论是， N批处理意味着要进行N次反向传播，并且需要用dL/dW梯度除以N才能平均/摊销该批处理的总效果。 But now it seems like I only have to do it one time and the division should be "in the first step". 但是现在看来，我只需要这样做一次，那么该部门应该是“第一步”。

Edit: 编辑：

I also find the version which seems to divide it at last step mnielsen/neural-networks-and-deep-learning - GitHub , for reference. 我还找到了似乎在最后一步mnielsen / neural-networks-and-deep-learning-GitHub上进行划分的版本，以供参考。

Since the softmax-cross-entropy layer class is the last layer of the net, in back-propagation it will becomes the "first step" as I mentioned above: 由于softmax-cross-entropy层类别是网络的最后一层，因此在反向传播中，它将成为如上所述的“第一步”：

class SoftmaxCrossEntropy(object):
    ...
    def backward(self, dout=1):
        batch_size = self._t.shape[0]
        # one-hot
        if self._t.size == self._y.size:
            dx = (self._y - self._t) / batch_size # <-- why divided by N here?
        else: # not one-hot
            dx = self._y * 1
            dx[np.arange(batch_size), self._t] -= 1
            dx = dx / batch_size       #    <-- why divided by N here?
        return dx
    ...

Answer 1

Does it means that the effect of these N batches of input are combined/condensed into the dW? 这是否意味着将这N批输入的效果合并/压缩到dW中？

Since the i-th row of X is a 1-d array that is related to the i-th flatten image. 由于X的第i行是与第i个展平图像有关的1-d数组。 If I transpose X into 如果我将X换位

T
X , then it's columns to represent those flatten images. X ,然后是代表这些扁平图像的列。 If N increase, although 如果N增加，尽管

the dimension of the result( dW ) won't change, the intermediate steps of computing each element of 结果（ dW ）的维数不变，计算每个元素的中间步骤

T d L
dW = X * ---
d Y , increase, which means d Y ,增加，这意味着

N
dW = Sum ( pixel_info * class_scores_deriv. )
i,jn=1 i,nn,j , which N is the batch i,jn=1 i,nn,j ,其中N是批次

size. 尺寸。 Clearly each class_scores_derivative_(n,j) "join" the determination of dW_(i,j) , but this implies that the division by batch_size = N is needed, because the summation above is not the average effect of that batch, but what I need is dW to represent the average impact of that batch. 显然，每个class_scores_derivative_(n,j) “加入” dW_(i,j)的确定，但是这意味着需要用batch_size = N进行除法，因为上述总和不是该批次的平均效果，而是需要dW来表示该批次的平均影响。 If I divide each element of class_scores_deriv. 如果我划分class_scores_deriv.每个元素class_scores_deriv. , which is the line ，这是线
dx = (self._y - self._t) / batch_size , then dx = (self._y - self._t) / batch_size ，然后

N 1
Sum ( pixel_info * --- class_scores_deriv. )
n=1 i,n N n,j
1 N
= --- Sum ( pixel_info * class_scores_deriv. )
N n=1 i,nn,j
1
= --- dW
N , which is the real dW I want. N ,这是我想要的实际dW 。

So the answer (I hope) should be 所以答案（我希望）应该是

The entire batch determine the dW, but to condense it, the divisions in SoftmaxCrossEntropy::backward(self, dout=1) is needed. 整个批次确定dW，但为了进行压缩，需要在SoftmaxCrossEntropy::backward(self, dout=1)中进行划分。

神经网络：批次版本仿射层反向传播权矩阵更新

问题描述

Edit: 编辑：

1 个解决方案

解决方案1
0 2018-04-29 12:25:00

神经网络：批次版本仿射层反向传播权矩阵更新

问题描述

Edit: 编辑：

1 个解决方案

解决方案1 0 2018-04-29 12:25:00

解决方案1
0 2018-04-29 12:25:00