简体   繁体   中英

Minibatch SGD gradient computation- average or sum

I am trying to understand how the gradients are computed when using miinibatch SGD. I have implemented it in CS231 online course, but only came to realize that in intermediate layers the gradient is basically the sum over all the gradients computed for each sample (the same for the implementations in Caffe or Tensorflow). It is only in the last layer (the loss) that they are averaged by the number of samples. Is this correct? if so, does it mean that since in the last layer they are averaged, when doing backprop, all the gradients are also averaged automatically? Thanks!

It is best to understand why SGD works first.

Normally, what a neural network actually is, a very complex composite function of an input vector x, a label y(or target variable, changes according to whether the problem is classification or regression) and some parameter vector, w. Assume that we are working on classification. We are actually trying to do a maximum likelihood estimation (actually MAP estimation since we are certainly going to use L2 or L1 regularization, but this is too much technicality for now) for variable vector w. Assuming that samples are independent; then we have the following cost function:

p(y1|w,x1)p(y2|w,x2) ... p(yN|w,xN)

Optimizing this wrt to w is a mess due to the fact that all of these probabilities are multiplicated (this will produce an insanely complicated derivative wrt w). We use log probabilities instead (taking log does not change the extreme points and we divide by N, so we can treat our training set as a empirical probability distribution, p(x) )

J(X,Y,w)=-(1/N)(log p(y1|w,x1) + log p(y2|w,x2) + ... + log p(yN|w,xN))

This is the actual cost function we have. What the neural network actually does is to model the probability function p(yi|w,xi). This can be a very complex 1000+ layered ResNet or just a simple perceptron.

Now the derivative for w is simple to state, since we have an addition now:

dJ(X,Y,w)/dw = -(1/N)(dlog p(y1|w,x1)/dw + dlog p(y2|w,x2)/dw + ... + dlog p(yN|w,xN)/dw)

Ideally, the above is the actual gradient. But this batch calculation is not easy to compute. What if we are working on a dataset with 1M training samples? Worse, the training set may be a stream of samples x, which has an infinite size.

The Stochastic part of the SGD comes into play here. Pick m samples with m << N randomly and uniformly from the training set and calculate the derivative by using them:

 dJ(X,Y,w)/dw =(approx) dJ'/dw = -(1/m)(dlog p(y1|w,x1)/dw + dlog p(y2|w,x2)/dw + ... + dlog p(ym|w,xm)/dw)

Remember that we had an empirical (or actual in the case of infinite training set) data distribution p(x). The above operation of drawing m samples from p(x) and averaging them actually produces the unbiased estimator, dJ'/dw, for the actual derivative dJ(X,Y,w)/dw. What does that mean? Take many such m samples and calculate different dJ'/dw estimates, average them as well and you get dJ(X,Y,w)/dw very closely, even exactly, in the limit of infinite sampling. It can be shown that these noisy but unbiased gradient estimates will behave like the original gradient in the long run. On the average, SGD will follow the actual gradient's path (but it can get stuck at a different local minima, all depends on the selection of the learning rate). The minibatch size m is directly related to the inherent error in the noisy estimate dJ'/dw. If m is large, you get gradient estimates with low variance, you can use larger learning rates. If m is small or m=1 (online learning), the variance of the estimator dJ'/dw is very high and you should use smaller learning rates, or the algorithm may easily diverge out of control.

Now enough theory, your actual question was

It is only in the last layer (the loss) that they are averaged by the number of samples. Is this correct? if so, does it mean that since in the last layer they are averaged, when doing backprop, all the gradients are also averaged automatically? Thanks!

Yes, it is enough to divide by m in the last layer, since the chain rule will propagate the factor (1/m) to all parameters once the lowermost layer is multiplied by it. You don't need to do separately for each parameter, this will be invalid.

In the last layer they are averaged, and in the previous are summed. The summed gradients in previous layers are summed across different nodes from the next layer, not by the examples. This averaging is done only to make the learning process behave similarly when you change the batch size -- everything should work the same if you sum all the layers, but decrease the learning rate appropriately.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM