Neural network: Batch-version affine layer back-propagation weight matrix updation

Question

I'm trying to follow a book about deep learning but I found some part about affine layer of the network confusing. Say I have a network which accepts some handwriting number( 0 ~ 9 ) images(mnist), which are flatten into one dimensional array, eg np.array([123, 255, 0, ...]) and it will output the scores for each possible output, eg np.array([0., 0., 0.3, 0., 0., 0.6, 0., 0., 0., 0.1]) (So the image may be the number 5 ).

Here is my implementation of affine layer:

class AffineLayer(object):
    ...
    def backward(self, dy):
        dx = np.dot(dy, self._W.T)
        self._dW = np.dot(self._x.T, dy) # Question related part
        self._db = np.sum(dy, axis=0)
        dx = dx.reshape(self._original_x_shape)
        return dx
    ...

Here is some explanations:

self._W is the weight matrix.
The part of concern here is self._dW = np.dot(self._x.T, y) # Question related part .
This line is derived from an equality:
X * W + B = Y
(N,2) matrix product (2,3) (1,3) (N,3).
The notation (2,) comes from the X.shape of numpy.array , etc. To simplify my problem I choose these dimension numbers.

End of terminology, now here comes the question:

By some math(omitted), we can arrive the equality used in back-propagation,( so self._dW = np.dot(self._x.T, y) is used in the code):

d LT d L
--- == X * ---
d W d Y

(2,3) (2,N) * (N,3).

Notice that no matter how I adjust N , which is the size of the batch, the dimension of dL/dW , the partial-derivative-of-L weight-matrix, won't change, and it's always (2,3) .

Does this mean that the total effect of these N batches are combined/condensed into the dL/dW ? This is related to how I would implement the output layer, eg softmax-cross-entropy layer as the final layer. My current conclusion for this is that N batch means doing the back-propagation N times and the division of the gradient dL/dW by N is needed to average/amortize the total effect of that batch. But now it seems like I only have to do it one time and the division should be "in the first step".

Edit:

I also find the version which seems to divide it at last step mnielsen/neural-networks-and-deep-learning - GitHub , for reference.

Since the softmax-cross-entropy layer class is the last layer of the net, in back-propagation it will becomes the "first step" as I mentioned above:

class SoftmaxCrossEntropy(object):
    ...
    def backward(self, dout=1):
        batch_size = self._t.shape[0]
        # one-hot
        if self._t.size == self._y.size:
            dx = (self._y - self._t) / batch_size # <-- why divided by N here?
        else: # not one-hot
            dx = self._y * 1
            dx[np.arange(batch_size), self._t] -= 1
            dx = dx / batch_size       #    <-- why divided by N here?
        return dx
    ...

Answer 1

Does it means that the effect of these N batches of input are combined/condensed into the dW?

Since the i-th row of X is a 1-d array that is related to the i-th flatten image. If I transpose X into

T
X , then it's columns to represent those flatten images. If N increase, although

the dimension of the result( dW ) won't change, the intermediate steps of computing each element of

T d L
dW = X * ---
d Y , increase, which means

N
dW = Sum ( pixel_info * class_scores_deriv. )
i,jn=1 i,nn,j , which N is the batch

size. Clearly each class_scores_derivative_(n,j) "join" the determination of dW_(i,j) , but this implies that the division by batch_size = N is needed, because the summation above is not the average effect of that batch, but what I need is dW to represent the average impact of that batch. If I divide each element of class_scores_deriv. , which is the line
dx = (self._y - self._t) / batch_size , then

N 1
Sum ( pixel_info * --- class_scores_deriv. )
n=1 i,n N n,j
1 N
= --- Sum ( pixel_info * class_scores_deriv. )
N n=1 i,nn,j
1
= --- dW
N , which is the real dW I want.

So the answer (I hope) should be

The entire batch determine the dW, but to condense it, the divisions in SoftmaxCrossEntropy::backward(self, dout=1) is needed.

Neural network: Batch-version affine layer back-propagation weight matrix updation

Question

Edit:

1 answers

solution1
0 2018-04-29 12:25:00

Neural network: Batch-version affine layer back-propagation weight matrix updation

Question

Edit:

1 answers

solution1 0 2018-04-29 12:25:00

solution1
0 2018-04-29 12:25:00