Loss function increasing instead of decreasing

Question

I have been trying to make my own neural networks from scratch. After some time, I made it, but I run into a problem I cannot solve. I have been following a tutorial which shows how to do this. The problem I run into, was how my network updates weights and biases. Well, I know that gradient descent won't be always decreasing loss and for a few epochs it might even increase a bit, bit it still should decrease and work much better than mine does. Sometimes the whole process gets stuck on loss 9 and 13 and it cannot get out of it. I have checked many tutorials, videos and websites, but I couldn't find anything wrong in my code. self.activate , self.dactivate , self.loss and self.dloss :

# sigmoid
self.activate = lambda x: np.divide(1, 1 + np.exp(-x))
self.dactivate = lambda x: np.multiply(self.activate(x), (1 - self.activate(x)))

# relu
self.activate = lambda x: np.where(x > 0, x, 0)
self.dactivate = lambda x: np.where(x > 0, 1, 0)

# loss I use (cross-entropy)
clip = lambda x: np.clip(x, 1e-10, 1 - 1e-10) # it's used to squeeze x into a probability between 0 and 1 (which I think is required)
self.loss = lambda x, y: -(np.sum(np.multiply(y, np.log(clip(x))) + np.multiply(1 - y, np.log(1 - clip(x))))/y.shape[0])
self.dloss = lambda x, y: -(np.divide(y, clip(x)) - np.divide(1 - y, 1 - clip(x)))

The code I use for forwardpropagation:

self.activate(np.dot(X, self.weights) + self.biases) # it's an example for first hidden layer

And that's the code for backpropagation:

First part, in DenseNeuralNetwork class:

last_derivative = self.dloss(output, y)

for layer in reversed(self.layers):
    last_derivative = layer.backward(last_derivative, self.lr)

And the second part, in Dense class:

def backward(self, last_derivative, lr):
    w = self.weights

    dfunction = self.dactivate(last_derivative)
    d_w = np.dot(self.layer_input.T, dfunction) * (1./self.layer_input.shape[1])
    d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), last_derivative)

    self.weights -= np.multiply(lr, d_w)
    self.biases -= np.multiply(lr, d_b)

    return np.dot(dfunction, w.T)

I have also made a repl so you can check the whole code and run it without any problems.

Answer 1

1.

line 12

self.dloss = lambda x, y: -(np.divide(y, clip(x)) - np.divide(1 - y, 1 - clip(x)))

if you're going to clip x, you shoud clip y too.
I mean there are some ways to implement this, but if you are going to use this way.
change to

self.dloss = lambda x, y: -(np.divide(clip(y), clip(x)) - np.divide(1 - clip(y), 1 - clip(x)))

2.

line 75

dfunction = self.dactivate(last_derivative)

this back propagation part is just wrong.
change to

dfunction = last_derivative*self.dactivate(np.dot(self.layer_input, self.weights) + self.biases)

3.

line 77

d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), last_derivative)

last_derivative should be dfunction. I think this is just a mistake.
change to

d_b = (1./self.layer_input.shape[1]) * np.dot(np.ones((self.biases.shape[0], last_derivative.shape[0])), dfunction)

4.

line 85

self.weights = np.random.randn(neurons, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons))
self.biases = np.random.randn(1, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons))

Not sure where you are going with this, but I think the initialized values are too big. We're not doing precise hypertuning, so I just made it small.

self.weights = np.random.randn(neurons, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons)) / 100
self.biases = np.random.randn(1, self.neurons) * np.divide(6, np.sqrt(self.neurons * neurons)) / 100

All good now

After this I changed the learning rate to 0.01 because it was to slow, and it worked fine.
I think you are misunderstanding back propagation. You should probably double check how it works. The other parts are ok I think.

Answer 2

This can be caused by your training data. Either it is too small or too many diverse labels (What i get from your code from the link you share).

I re-run your code several times and it produce different training performance. Sometimes the loss keeps decreasing until last epoch, some times it keep increasing, in one time it decreased until some point and it increasing. (With minimum loss achieved of 0.5)

I think it is your training data that matters this time. The learning rate is good enough though (Assuming you did the calculation for Linear combination, back propagation, etc right).

Loss function increasing instead of decreasing

Question

2 answers

solution1
1 ACCPTED 2020-03-05 10:59:30

1.

2.

3.

4.

All good now

solution2
0 2020-03-05 04:32:56

Loss function increasing instead of decreasing

Question

2 answers

solution1 1 ACCPTED 2020-03-05 10:59:30

1.

2.

3.

4.

All good now

solution2 0 2020-03-05 04:32:56

solution1
1 ACCPTED 2020-03-05 10:59:30

solution2
0 2020-03-05 04:32:56