Issue with gradient calculation in a Neural Network (stuck at 7% error in MNIST)

Question

Hi I am having an issue with my calculation of checking the gradient when implementing a neural network in python using numpy. I am using mnist dataset to try and trying to using mini-batch gradient descent.

I have check the math and on paper look good so maybe you can give me a hint of what's happening here:

EDIT: One answer made me realize that indeed the cost function was being calculated wrong. Howerver that does not explain the problem with the gradient as it is calculated using back_prop. I get %7 error rate using 300 units in the hidden layer using minibatch gradient descent with rmsprop , 30 epochs and 100 batches. ( learning_rate = 0.001, small due to the rmsprop).

each input is has 768 features so for a 100 samples I have a matrix. Mnist has 10 classes.

X = NoSamplesxFeatures = 100x768

Y = NoSamplesxClasses = 100x10

I am using a one hidden layer neural network with hidden layer size of 300 when fully training. Another question I have is whether I should use a softmax output function for calculating the error... which I think not. But I am kinda newbie to all of this and the obvious might seem strange to me.

(NOTE: I know the code is ugly, but this is my first Python/Numpy code done under pressure, bear with me)

Here is back_prof and activations:

  def sigmoid(z):
     return np.true_divide(1,1 + np.exp(-z) )

  #not calculated really - this the fake version to make it faster. 
  def sigmoid_prime(a):
     return  (a)*(1 - a)

  def _back_prop(self,W,X,labels,f=sigmoid,fprime=sigmoid_prime,lam=0.001):

    """
    Calculate the partial derivates of the cost function using backpropagation.
    """     
    #Weight for first layer and hidden layer
    Wl1,bl1,Wl2,bl2  = self._extract_weights(W)


    # get the forward prop value
    layers_outputs = self._forward_prop(W,X,f)

    #from a number make a binary vector, for mnist 1x10 with all 0 but the number.
    y = self.make_1_of_c_encoding(labels)
    num_samples = X.shape[0] # layers_outputs[-1].shape[0]

    # Dot product return  Numsamples (N) x Outputs (No CLasses)
    # Y is NxNo Clases
    # Layers output to


    big_delta = np.zeros(Wl2.size + bl2.size + Wl1.size + bl1.size)
    big_delta_wl1, big_delta_bl1, big_delta_wl2, big_delta_bl2 = self._extract_weights(big_delta)


    # calculate the gradient for each training sample in the batch and accumulate it

    for i,x in enumerate(X):

        # Error with respect  the output
        dE_dy =  layers_outputs[-1][i,:] -  y[i,:] 

        # bias hidden layer
        big_delta_bl2 +=   dE_dy


        # get the error for the hiddlen layer
        dE_dz_out  = dE_dy * fprime(layers_outputs[-1][i,:])

        #and for the input layer
        dE_dhl = dE_dy.dot(Wl2.T)

        #bias input layer
        big_delta_bl1 += dE_dhl

        small_delta_hl = dE_dhl*fprime(layers_outputs[-2][i,:])

        #here calculate the gradient for the weights in the hidden and first layer
        big_delta_wl2 += np.outer(layers_outputs[-2][i,:],dE_dz_out)
        big_delta_wl1 +=   np.outer(x,small_delta_hl)





    # divide by number of samples in the batch (should be done here)?
    big_delta_wl2 = np.true_divide(big_delta_wl2,num_samples) + lam*Wl2*2
    big_delta_bl2 = np.true_divide(big_delta_bl2,num_samples)
    big_delta_wl1 = np.true_divide(big_delta_wl1,num_samples) + lam*Wl1*2
    big_delta_bl1 = np.true_divide(big_delta_bl1,num_samples)

    # return 
    return np.concatenate([big_delta_wl1.ravel(),
                           big_delta_bl1,
                           big_delta_wl2.ravel(),
                           big_delta_bl2.reshape(big_delta_bl2.size)])

Now the feed_forward:

def _forward_prop(self,W,X,transfer_func=sigmoid):
    """
    Return the output of the net a Numsamples (N) x Outputs (No CLasses)
    # an array containing the size of the output of all of the laye of the neural net
    """

    # Hidden layer DxHLS
    weights_L1,bias_L1,weights_L2,bias_L2 = self._extract_weights(W)    

    # Output layer HLSxOUT

    # A_2 = N x HLS
    A_2 = transfer_func(np.dot(X,weights_L1) + bias_L1 )

    # A_3 = N x  Outputs
    A_3 = transfer_func(np.dot(A_2,weights_L2) + bias_L2)

    # output layer
    return [A_2,A_3]

And the cost function for the gradient checking:

 def cost_function(self,W,X,labels,reg=0.001):
    """
    reg: regularization term
    No weight decay term - lets leave it for later
    """

    outputs = self._forward_prop(W,X,sigmoid)[-1] #take the last layer out
    sample_size = X.shape[0]

    y = self.make_1_of_c_encoding(labels)

    e1 = np.sum((outputs - y)**2, axis=1))*0.5

    #error = e1.sum(axis=1)
    error = e1.sum()/sample_size + 0.5*reg*(np.square(W)).sum()

    return error

Answer 1

What kind of results are you getting when you run gradient checking? Often times you can tease out the nature of the implementation error by looking at the output of your gradient vs the output produced by gradient checking.

Furthermore, square error is usually a poor choice for a classification task such as MNIST and I would suggest using either a simple sigmoid top-layer or a softmax. With sigmoid the cross entropy function you want to use is:

L(h,Y) = -Y*log(h) - (1-Y)*log(1-h)

For a softmax

L(h,Y) = -sum(Y*log(h))

where Y is the target given as a 1x10 vector and h is your predicted value, but easily extends to arbitrary batch sizes.

In both cases the top-layer delta simply becomes:

delta = h - Y

And the top-layer gradient becomes:

grad = dot(delta, A_in)

Where A_in is the input into the top layer from the previous layer.

While I am having some trouble getting my head around your backprop routine, I suspect from your code that the error in gradient is due to the fact that you are not calculating the top-level dE/dw_l2 correctly when using square error, along with computing fprime on the incorrect input.

When using square error the top layer delta should be:

delta = (h - Y) * fprime(Z_l2)

Here Z_l2 is the input into your transfer function for layer 2. Similarly when computing fprime for the lower layers, you want to use the input to your transfer function (ie dot(X,weights_L1) + bias_L1)

Hope that helps.

EDIT: As some added justification for using cross entropy error over square error I would suggest looking up Geoffrey Hinton's lectures on linear classification methods: www.cs.toronto.edu/~hinton/csc2515/notes/lec3.ppt

EDIT2: I ran some tests locally with my implementation of neural nets on the MNIST dataset with different parameters and 1 hidden layer using RMSPROP. Here are the results:

Test1
Epochs: 30
Hidden Size: 300 
Learn Rate: 0.001
Lambda: 0.001
Train Method: RMSPROP with decrements=0.5 and increments=1.3 
Train Error: 6.1%
Test Error: 6.9%

Test2
Epochs: 30
Hidden Size: 300 
Learn Rate: 0.001
Lambda: 0.000002
Train Method: RMSPROP with decrements=0.5 and increments=1.3 
Train Error: 4.5%
Test Error: 5.7%

It already appears that if you decrease your lambda parameter by a couple orders of magnitude you should end up with better performance.

Issue with gradient calculation in a Neural Network (stuck at 7% error in MNIST)

Question

1 answers

solution1
4 ACCPTED 2013-05-26 07:29:50

Issue with gradient calculation in a Neural Network (stuck at 7% error in MNIST)

Question

1 answers

solution1 4 ACCPTED 2013-05-26 07:29:50

solution1
4 ACCPTED 2013-05-26 07:29:50