What's the proper way to do back propagation in Deep Fully Connected Neural Network for binary classification

Question

I tried to implement a Deep fully connected neural network for binary classification using python and numpy and used Gradient Descent as optimization algorithm.

Turns out my model is heavily under fitting , even after 1000 epochs . My loss never improves beyond 0.69321 , i tried checking my weight derivatives and instantly realized they're very small ( as small as 1e-7 ), such small gradients are causing my model to never have bigger gradient descent updates and never reach the global minima . I will detail out the math/pseudo code for forward and backward propagation's, please let me know if I'm on the right track. I will follow the naming convention used in DeepLearning.ai By Andrew Ng .

Say we have 4 layer neural network with only one node at the output layer to classify between 0/1.

X -> Z1 - > A1 - > Z2 - > A2 - > Z3 - > A3 - > Z4 - > A4

Forward propagation

Z1 = W1 dot_product X + B1
A1 = tanh_activation(Z1)

Z2 = W2 dot_product A1 + B2
A2 = tanh_activation(Z2)

Z3 = W3 dot_product A2 + B3
A3 = tanh_activation(Z3)

Z4 = W4 dot_product A3 + B4
A4 = sigmoid_activation(Z4)

Backward Propagation

DA4 = -( Y / A4 + (1 - Y /  1 - A4 ) ) ( derivative of output activations or logits w.r.t to loss function )

DZ4 = DA4 * derivative_tanh(Z4) ( derivative of tanh activation, which I assume is ( 1 - (Z4 ) ^ 2 ) )
Dw4 = ( dZ4 dot_produt A3.T ) / total_number_of_samples
Db4 = np.sum(DZ4, axis = 1, keepdims = True ... ) / total_number_of_samples
DA3 = W4.T dot_product(DZ4)


DZ3 = DA3 * derivative_tanh( Z3 )
DW3 = ( DZ3 dot_product A2.T ) / total_number_of_samples
DB3 = np.sum( DZ3, .. ) / total_number_of_samples
DA2 = W3.T dot_product(DZ3)


DZ2 = DA2 * derivative_tanh( Z2 )
DW2 = ( DZ2 dot_product A1.T ) / total_number_of_samples
DB2 = np.sum( DZ2, .. ) / total_number_of_samples
DA1 = W2.T dot_product(DZ2)



DZ1 = DA1 * derivative_tanh( Z1 )
DW1 = ( DZ1 dot_product X.T ) / total_number_of_samples
DB1 = np.sum( DZ1, .. ) / total_number_of_samples

This is my tanh implementation ,

def tanh_activation(x):
 return np.tanh(x)

My tanh derivative implementation

def derivative_tanh(x):
 return ( 1 - np.power(x,2))

After the above back propagation steps I updated the weights and biases using gradient descent with their respective derivatives. But, no matter how many times I run the algorithm, the model never improves it's loss beyond 0.69 and the derivatives of output weights ( in my case dW4 ) is pretty low 1e-7 . I'm assuming that either my derivative_tanh function or my calculations of dZ is really off, which is causing bad loss values to propagate back to the network. Please share your thoughts whether my implementation of backprop is valid or not. TIA. I went through back propagation gradient descent calculus

and

how to optimize weights of neural network .. and many other blogs, but couldn't find for what I was looking for.

Answer 1

I found a fix to my problem and answered here: What's the proper way to do back propagtion in deep fully connected neural network . I suggest closing the thread.

What's the proper way to do back propagation in Deep Fully Connected Neural Network for binary classification

Question

1 answers

solution1
0 2019-09-20 17:59:37

What's the proper way to do back propagation in Deep Fully Connected Neural Network for binary classification

Question

1 answers

solution1 0 2019-09-20 17:59:37

solution1
0 2019-09-20 17:59:37