简体   繁体   中英

Back-propagation and forward-propagation for 2 hidden layers in neural network

My question is about forward and backward propagation for deep neural networks when the number of hidden units is greater than 1.

I know what I have to do if I have a single hidden layer. In case of a single hidden layer, if my input data X_train has n samples, with d number of features (ie X_train is a (n, d) dimensional matrix, y_train is a (n,1) dimensional vector) and if I have h1 number of hidden units in my first hidden layer, then I use Z_h1 = (X_train * w_h1) + b_h1 (where w_h1 is a weight matrix with random number entries which has the shape (d, h1) and b_h1 is a bias unit with shape (h1,1) . I use sigmoid activation A_h1 = sigmoid(Z_h1) and find that both A_h1 and Z_h1 have shapes (n, h1) . If I have t number of output units, then I use a weight matrix w_out with dimensions (h1, t) and b_out with shape (t,1) to get the output Z_out = (A_h1 * w_h1) + b_h1 . From here I can get A_out = sigmoid(Z_out) which has shape (n, t) . If I have a 2nd hidden layer (with h2 number of units) after the 1st hidden layer and before the output layer, then what steps must I add to the forward propagation and which steps should I modify?

I also have idea about how to tackle backpropagation in case of single hidden layer neural networks. For the single hidden layer example in the previous paragraph, I know that in the first backpropagation step (output layer -> hidden layer1) , I should do Step1_BP1: Err_out = A_out - y_train_onehot (here y_train_onehot is the onehot representation of y_train . Err_out has shape (n,t) . This is followed by Step2_BP1: delta_w_out = (A_h1)^T * Err_out and delta_b_out = sum(Err_out) . The symbol (.)^T denotes the transpose of matrix. For the second backpropagation step (hidden layer1 -> input layer) , we do the following Step1_BP2: sig_deriv_h1 = (A_h1) * (1-A_h1) . Here sig_deriv_h1 has shape (n,h1) . In the next step, I do Step2_BP2: Err_h1 = \\Sum_i \\Sum_j [ ( Err_out * w_out.T)_{i,j} * sig_deriv_h1__{i,j} ) ]. Here, Err_h1 has shape (n,h1) . In the final step, I do Step3_BP2: delta_w_h1 = (X_train)^T * Err_h1 and delta_b_h1 = sum(Err_h1) . What backpropagation steps should I add if I have a 2nd hidden layer (h2 number of units) after the 1st hidden layer and before the output layer? Should I modify the backpropagation steps for the one hidden layer case that I have described here?

● Let X be a matrix of samples with shape (n, d) , where n denotes number of samples, and d denotes number of features.

● Let w h1 be the matrix of weights - of shape (d, h1) , and

● Let b h1 be the bias vector of shape (1, h1) .

You need the following steps for forward and backward propagations:

FORWARD PROPAGATION:

Step 1:

Z h1 = [ X • w h1 ] + b h1

↓ ↓ ↓ ↓

(n,h1) (n,d) (d,h1) (1,h1)

Here, the symbol • represents matrix multiplication, and the h1 denotes the number of hidden units in the first hidden layer.

Step 2:

Let Φ() be the activation function. We get.

a h1 = Φ (Z h1 )

↓ ↓

(n,h1) (n,h1)

Step 3:

Obtain new weights and biases:

w h2 of shape (h1, h2) , and

b h2 of shape (1, h2) .

Step 4:

Z h2 = [ a h1 • w h2 ] + b h2

↓ ↓ ↓ ↓

(n,h2) (n,h1) (h1,h2) (1,h2)

Here, h2 is the number of hidden units in the second hidden layer.

Step 5:

a h2 = Φ (Z h2 )

↓ ↓

(n,h2) (n,h2)

Step 6:

Obtain new weights and biases:

w out of shape (h2, t) , and

b out of shape (1, t) .

Here, t is the number of classes.

Step 7:

Z out = [ a h2 • w out ] + b out

↓ ↓ ↓ ↓

(n,t) (n,h2) (h2,t) (1,t)

Step 8:

a out = Φ (Z out )

↓ ↓

(n,t) (n,t)

BACKWARD PROPAGATION:

Step 1:

Construct the one-hot encoded matrix of the unique output classes ( y one-hot ).

Error out = a out - y one-hot

↓ ↓ ↓

(n,t) (n,t) (n,t)

Step 2:

Δw out = η ( a h2 T • Error out )

↓ ↓ ↓

(h2,t) (h2,n) (n,t)

Δb out = η [ ∑ i=1 n (Error out,i ) ]

↓ ↓

(1,t) (1,t)

Here η is the learning rate.

w out = w out - Δw out (weight update.)

b out = b out - Δb out (bias update.)

Step 3:

Error 2 = [Error out • w out T ] ✴ Φ / (a h2 )

↓ ↓ ↓ ↓

(n,h2) (n,t) (t,h2) (n,h2)

Here, the symbol ✴ denotes element wise matrix multiplication. The symbol Φ / represents derivative of sigmoid function.

Step 4:

Δw h2 = η ( a h1 T • Error 2 )

↓ ↓ ↓

(h1,h2) (h1,n) (n,h2)

Δb h2 = η [ ∑ i=1 n (Error 2,i ) ]

↓ ↓

(1,h2) (1,h2)

w h2 = w h2 - Δw h2 (weight update.)

b h2 = b h2 - Δb h2 (bias update.)

Step 5:

Error 3 = [Error 2 • w h2 T ] ✴ Φ / (a h1 )

↓ ↓ ↓ ↓

(n,h1) (n,h2) (h2,h1) (n,h1)

Step 6:

Δw h1 = η ( X T • Error 3 )

↓ ↓ ↓

(d,h1) (d,n) (n,h1)

Δb h1 = η [ ∑ i=1 n (Error 3,i ) ]

↓ ↓

(1,h1) (1,h1)

w h1 = w h1 - Δw h1 (weight update.)

b h1 = b h1 - Δb h1 (bias update.)

For Forward Propagation, the dimension of the output from the first hidden layer must cope up with the dimensions of the second input layer.

As mentioned above, your input has dimension (n,d) . The output from hidden layer1 will have a dimension of (n,h1) . So the weights and bias for the second hidden layer must be (h1,h2) and (h1,h2) respectively.

So w_h2 will be of dimension (h1,h2) and b_h2 will be (h1,h2) .

The dimensions for the weights and bias for the output layer will be w_output will be of dimension (h2,1) and b_output will be (h2,1) .

The same you have to repeat in Backpropagation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM