I'm trying to follow a book about deep learning but I found some part about affine layer of the network confusing. Say I have a network which accepts some handwriting number( 0 ~ 9
) images(mnist), which are flatten into one dimensional array, eg np.array([123, 255, 0, ...])
and it will output the scores for each possible output, eg np.array([0., 0., 0.3, 0., 0., 0.6, 0., 0., 0., 0.1])
(So the image may be the number 5
).
Here is my implementation of affine layer:
class AffineLayer(object):
...
def backward(self, dy):
dx = np.dot(dy, self._W.T)
self._dW = np.dot(self._x.T, dy) # Question related part
self._db = np.sum(dy, axis=0)
dx = dx.reshape(self._original_x_shape)
return dx
...
Here is some explanations:
self._W
is the weight matrix. self._dW = np.dot(self._x.T, y) # Question related part
. This line is derived from an equality:
X * W + B = Y
(N,2) matrix product (2,3) (1,3) (N,3).
The notation (2,)
comes from the X.shape
of numpy.array
, etc. To simplify my problem I choose these dimension numbers.
End of terminology, now here comes the question:
By some math(omitted), we can arrive the equality used in back-propagation,( so
self._dW = np.dot(self._x.T, y)
is used in the code):
d LT d L
--- == X * ---
d W d Y
(2,3) (2,N) * (N,3).
Notice that no matter how I adjust
N
, which is the size of the batch, the dimension ofdL/dW
, the partial-derivative-of-L weight-matrix, won't change, and it's always(2,3)
.Does this mean that the total effect of these
N
batches are combined/condensed into thedL/dW
? This is related to how I would implement the output layer, eg softmax-cross-entropy layer as the final layer. My current conclusion for this is thatN
batch means doing the back-propagationN
times and the division of the gradientdL/dW
byN
is needed to average/amortize the total effect of that batch. But now it seems like I only have to do it one time and the division should be "in the first step".
I also find the version which seems to divide it at last step mnielsen/neural-networks-and-deep-learning - GitHub , for reference.
Since the softmax-cross-entropy layer class is the last layer of the net, in back-propagation it will becomes the "first step" as I mentioned above:
class SoftmaxCrossEntropy(object):
...
def backward(self, dout=1):
batch_size = self._t.shape[0]
# one-hot
if self._t.size == self._y.size:
dx = (self._y - self._t) / batch_size # <-- why divided by N here?
else: # not one-hot
dx = self._y * 1
dx[np.arange(batch_size), self._t] -= 1
dx = dx / batch_size # <-- why divided by N here?
return dx
...
Does it means that the effect of these N batches of input are combined/condensed into the dW?
Since the i-th row of X
is a 1-d array that is related to the i-th flatten image. If I transpose X
into
T
X ,
then it's columns to represent those flatten images. If N
increase, although
the dimension of the result( dW
) won't change, the intermediate steps of computing each element of
T d L
dW = X * ---
d Y ,
increase, which means
N
dW = Sum ( pixel_info * class_scores_deriv. )
i,jn=1 i,nn,j ,
which N
is the batch
size. Clearly each class_scores_derivative_(n,j)
"join" the determination of dW_(i,j)
, but this implies that the division by batch_size = N
is needed, because the summation above is not the average effect of that batch, but what I need is dW
to represent the average impact of that batch. If I divide each element of class_scores_deriv.
, which is the line
dx = (self._y - self._t) / batch_size
, then
N 1
Sum ( pixel_info * --- class_scores_deriv. )
n=1 i,n N n,j
1 N
= --- Sum ( pixel_info * class_scores_deriv. )
N n=1 i,nn,j
1
= --- dW
N ,
which is the real dW
I want.
So the answer (I hope) should be
The entire batch determine the dW, but to condense it, the divisions in
SoftmaxCrossEntropy::backward(self, dout=1)
is needed.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.