神经网络中的梯度计算问题（在MNIST中停留在7％的错误中）

Question

Hi I am having an issue with my calculation of checking the gradient when implementing a neural network in python using numpy. 嗨，我在使用numpy在python中实现神经网络时检查梯度的计算遇到问题。 I am using mnist dataset to try and trying to using mini-batch gradient descent. 我正在使用mnist数据集尝试并尝试使用小批量梯度下降。

I have check the math and on paper look good so maybe you can give me a hint of what's happening here: 我已经检查了数学并在纸面上看起来不错，所以也许您可以给我提示这里发生了什么：

EDIT: One answer made me realize that indeed the cost function was being calculated wrong. 编辑：一个答案让我意识到，成本函数的确计算错误。 Howerver that does not explain the problem with the gradient as it is calculated using back_prop. 但是，这不能解释渐变问题，因为它是使用back_prop计算的。 I get %7 error rate using 300 units in the hidden layer using minibatch gradient descent with rmsprop , 30 epochs and 100 batches. 我使用rmsprop ，30个历元和100个批次的minibatch gradient下降法在隐藏层中使用300个单位得到％7的错误率。 ( learning_rate = 0.001, small due to the rmsprop). （ learning_rate = 0.001，由于rmsprop而较小）。

each input is has 768 features so for a 100 samples I have a matrix. 每个输入具有768个功能，因此对于100个样本，我有一个矩阵。 Mnist has 10 classes. Mnist有10个班级。

X = NoSamplesxFeatures = 100x768

Y = NoSamplesxClasses = 100x10

I am using a one hidden layer neural network with hidden layer size of 300 when fully training. 完全训练后，我正在使用一个隐藏层神经网络，其中隐藏层大小为300。 Another question I have is whether I should use a softmax output function for calculating the error... which I think not. 我还有一个问题是我是否应该使用softmax输出函数来计算错误...我认为不是。 But I am kinda newbie to all of this and the obvious might seem strange to me. 但是，对于所有这些我都是新手，显然，对我来说似乎很奇怪。

(NOTE: I know the code is ugly, but this is my first Python/Numpy code done under pressure, bear with me) （注意：我知道代码很丑陋，但这是我在压力下完成的第一个Python / Numpy代码，请多多包涵）

Here is back_prof and activations: 这是back_prof和激活：

  def sigmoid(z):
     return np.true_divide(1,1 + np.exp(-z) )

  #not calculated really - this the fake version to make it faster. 
  def sigmoid_prime(a):
     return  (a)*(1 - a)

  def _back_prop(self,W,X,labels,f=sigmoid,fprime=sigmoid_prime,lam=0.001):

    """
    Calculate the partial derivates of the cost function using backpropagation.
    """     
    #Weight for first layer and hidden layer
    Wl1,bl1,Wl2,bl2  = self._extract_weights(W)


    # get the forward prop value
    layers_outputs = self._forward_prop(W,X,f)

    #from a number make a binary vector, for mnist 1x10 with all 0 but the number.
    y = self.make_1_of_c_encoding(labels)
    num_samples = X.shape[0] # layers_outputs[-1].shape[0]

    # Dot product return  Numsamples (N) x Outputs (No CLasses)
    # Y is NxNo Clases
    # Layers output to


    big_delta = np.zeros(Wl2.size + bl2.size + Wl1.size + bl1.size)
    big_delta_wl1, big_delta_bl1, big_delta_wl2, big_delta_bl2 = self._extract_weights(big_delta)


    # calculate the gradient for each training sample in the batch and accumulate it

    for i,x in enumerate(X):

        # Error with respect  the output
        dE_dy =  layers_outputs[-1][i,:] -  y[i,:] 

        # bias hidden layer
        big_delta_bl2 +=   dE_dy


        # get the error for the hiddlen layer
        dE_dz_out  = dE_dy * fprime(layers_outputs[-1][i,:])

        #and for the input layer
        dE_dhl = dE_dy.dot(Wl2.T)

        #bias input layer
        big_delta_bl1 += dE_dhl

        small_delta_hl = dE_dhl*fprime(layers_outputs[-2][i,:])

        #here calculate the gradient for the weights in the hidden and first layer
        big_delta_wl2 += np.outer(layers_outputs[-2][i,:],dE_dz_out)
        big_delta_wl1 +=   np.outer(x,small_delta_hl)





    # divide by number of samples in the batch (should be done here)?
    big_delta_wl2 = np.true_divide(big_delta_wl2,num_samples) + lam*Wl2*2
    big_delta_bl2 = np.true_divide(big_delta_bl2,num_samples)
    big_delta_wl1 = np.true_divide(big_delta_wl1,num_samples) + lam*Wl1*2
    big_delta_bl1 = np.true_divide(big_delta_bl1,num_samples)

    # return 
    return np.concatenate([big_delta_wl1.ravel(),
                           big_delta_bl1,
                           big_delta_wl2.ravel(),
                           big_delta_bl2.reshape(big_delta_bl2.size)])

Now the feed_forward: 现在feed_forward：

def _forward_prop(self,W,X,transfer_func=sigmoid):
    """
    Return the output of the net a Numsamples (N) x Outputs (No CLasses)
    # an array containing the size of the output of all of the laye of the neural net
    """

    # Hidden layer DxHLS
    weights_L1,bias_L1,weights_L2,bias_L2 = self._extract_weights(W)    

    # Output layer HLSxOUT

    # A_2 = N x HLS
    A_2 = transfer_func(np.dot(X,weights_L1) + bias_L1 )

    # A_3 = N x  Outputs
    A_3 = transfer_func(np.dot(A_2,weights_L2) + bias_L2)

    # output layer
    return [A_2,A_3]

And the cost function for the gradient checking: 以及用于梯度检查的成本函数：

 def cost_function(self,W,X,labels,reg=0.001):
    """
    reg: regularization term
    No weight decay term - lets leave it for later
    """

    outputs = self._forward_prop(W,X,sigmoid)[-1] #take the last layer out
    sample_size = X.shape[0]

    y = self.make_1_of_c_encoding(labels)

    e1 = np.sum((outputs - y)**2, axis=1))*0.5

    #error = e1.sum(axis=1)
    error = e1.sum()/sample_size + 0.5*reg*(np.square(W)).sum()

    return error

Answer 1

What kind of results are you getting when you run gradient checking? 运行梯度检查时会得到什么样的结果？ Often times you can tease out the nature of the implementation error by looking at the output of your gradient vs the output produced by gradient checking. 通常，您可以通过查看梯度的输出与梯度检查产生的输出来弄清实现错误的性质。

Furthermore, square error is usually a poor choice for a classification task such as MNIST and I would suggest using either a simple sigmoid top-layer or a softmax. 此外，对于诸如MNIST之类的分类任务，平方误差通常不是一个好的选择，我建议使用简单的S型顶层或softmax。 With sigmoid the cross entropy function you want to use is: 对于S形，您要使用的交叉熵函数为：

L(h,Y) = -Y*log(h) - (1-Y)*log(1-h)

For a softmax 对于softmax

L(h,Y) = -sum(Y*log(h))

where Y is the target given as a 1x10 vector and h is your predicted value, but easily extends to arbitrary batch sizes. 其中Y是作为1x10向量给出的目标，h是您的预测值，但可以轻松扩展到任意批量。

In both cases the top-layer delta simply becomes: 在这两种情况下，顶层增量仅变为：

delta = h - Y

And the top-layer gradient becomes: 顶层渐变变为：

grad = dot(delta, A_in)

Where A_in is the input into the top layer from the previous layer. 其中A_in是上一层的顶层输入。

While I am having some trouble getting my head around your backprop routine, I suspect from your code that the error in gradient is due to the fact that you are not calculating the top-level dE/dw_l2 correctly when using square error, along with computing fprime on the incorrect input. 虽然我无法理解反向传播例程，但我从您的代码中怀疑梯度误差是由于您在使用平方误差以及计算时未正确计算顶级dE / dw_l2而导致的fprime输入错误。

When using square error the top layer delta should be: 使用平方误差时，顶层增量应为：

delta = (h - Y) * fprime(Z_l2)

Here Z_l2 is the input into your transfer function for layer 2. Similarly when computing fprime for the lower layers, you want to use the input to your transfer function (ie dot(X,weights_L1) + bias_L1) 这里Z_l2是第2层传递函数的输入。类似地，在计算较低层的fprime时，您想使用传递函数的输入（即dot（X，weights_L1）+ bias_L1）

Hope that helps. 希望能有所帮助。

EDIT: As some added justification for using cross entropy error over square error I would suggest looking up Geoffrey Hinton's lectures on linear classification methods: www.cs.toronto.edu/~hinton/csc2515/notes/lec3.ppt 编辑：作为使用平方交叉误差上的交叉熵误差的补充理由，我建议查找Geoffrey Hinton关于线性分类方法的讲座： www.cs.toronto.edu/~hinton/csc2515/notes/lec3.ppt

EDIT2: I ran some tests locally with my implementation of neural nets on the MNIST dataset with different parameters and 1 hidden layer using RMSPROP. EDIT2：我使用RMSPROP在MNIST数据集上使用不同的参数和1个隐藏层对神经网络的实现在本地运行了一些测试。 Here are the results: 结果如下：

Test1
Epochs: 30
Hidden Size: 300 
Learn Rate: 0.001
Lambda: 0.001
Train Method: RMSPROP with decrements=0.5 and increments=1.3 
Train Error: 6.1%
Test Error: 6.9%

Test2
Epochs: 30
Hidden Size: 300 
Learn Rate: 0.001
Lambda: 0.000002
Train Method: RMSPROP with decrements=0.5 and increments=1.3 
Train Error: 4.5%
Test Error: 5.7%

It already appears that if you decrease your lambda parameter by a couple orders of magnitude you should end up with better performance. 看来，如果将lambda参数减小几个数量级，您最终将获得更好的性能。

神经网络中的梯度计算问题（在MNIST中停留在7％的错误中）

问题描述

1 个解决方案

解决方案1
4 已采纳 2013-05-26 07:29:50

神经网络中的梯度计算问题（在MNIST中停留在7％的错误中）

问题描述

1 个解决方案

解决方案1 4 已采纳 2013-05-26 07:29:50

解决方案1
4 已采纳 2013-05-26 07:29:50