如何更新两层多层感知器中的学习率？

Question

Given the XOR problem: 鉴于XOR问题：

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

And a simple 和一个简单的

two layered Multi-Layered Perceptron (MLP) with 两层的多层感知器（MLP）
sigmoid activations between them and 它们之间的S型激活
Mean Square Error (MSE) as the loss function/optimization criterion 均方误差（MSE）作为损失函数/优化准则

[code]: [码]：

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx): # For backpropagation.
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def mse(predicted, truth):
    return np.sum(np.square(truth - predicted))

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))

# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))

And given the stopping criteria as a fixed no. 并以停止标准为固定编号。 of epochs (no. of iterations through the X and Y) with a fixed learning rate of 0.3: 固定学习速率为0.3的时期（通过X和Y进行的迭代次数）：

# Initialize weigh
num_epochs = 10000
learning_rate = 0.3

When I run through the forward-backward propagation and update the weights in each epoch, how should I update the weights? 当我进行前后传播并在每个时期更新权重时， 应如何更新权重？

I tried to simply add the product of the learning rate with the dot product of the backpropagated derivative with the layer outputs but the model still only updated the weights in one direction causing all the weights to degrade to near zero. 我试图简单地将学习率的乘积与反向传播导数的点乘积与层输出相加，但是该模型仍然仅在一个方向上更新权重，从而导致所有权重降到接近零。

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    layer2_error = mse(layer2, Y)

    #print(layer2_error)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_delta = layer2_error * sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
    #print(np.dot(layer0.T, layer1_delta))
    #print(epoch_n, list((layer2)))

    # Log the loss value as we proceed through the epochs.
    losses.append(layer2_error.mean())

How should the weights be updated correctly? 权重应如何正确更新？

Full code: 完整代码：

from itertools import chain
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)

def sigmoid(x): # Returns values that sums to one.
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sx):
    # See https://math.stackexchange.com/a/1225116
    return sx * (1 - sx)

# Cost functions.
def mse(predicted, truth):
    return np.sum(np.square(truth - predicted))

X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([[0,1,1,0]]).T

# Define the shape of the weight vector.
num_data, input_dim = X.shape
# Lets set the dimensions for the intermediate layer.
hidden_dim = 5
# Initialize weights between the input layers and the hidden layer.
W1 = np.random.random((input_dim, hidden_dim))

# Define the shape of the output vector. 
output_dim = len(Y.T)
# Initialize weights between the hidden layers and the output layer.
W2 = np.random.random((hidden_dim, output_dim))

# Initialize weigh
num_epochs = 10000
learning_rate = 0.3

losses = []

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    layer2_error = mse(layer2, Y)

    #print(layer2_error)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_delta = layer2_error * sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)
    #print(np.dot(layer0.T, layer1_delta))
    #print(epoch_n, list((layer2)))

    # Log the loss value as we proceed through the epochs.
    losses.append(layer2_error.mean())

# Visualize the losses
plt.plot(losses)
plt.show()

Am I missing anything in the backpropagation? 我在反向传播过程中缺少任何东西吗？

Maybe I missed out the derivative from the cost to the second layer? 也许我错过了从成本到第二层的导数？

Edited 已编辑

I realized I missed the partial derivative from the cost to the second layer and after adding it: 我意识到我错过了从成本到第二层并添加后的偏导数：

# Cost functions.
def mse(predicted, truth):
    return 0.5 * np.sum(np.square(predicted - truth)).mean()

def mse_derivative(predicted, truth):
    return predicted - truth

With the updated backpropagation loop across epochs: 使用跨历元的更新的反向传播循环：

for epoch_n in range(num_epochs):
    layer0 = X
    # Forward propagation.

    # Inside the perceptron, Step 2. 
    layer1 = sigmoid(np.dot(layer0, W1))
    layer2 = sigmoid(np.dot(layer1, W2))

    # Back propagation (Y -> layer2)

    # How much did we miss in the predictions?
    cost_error = mse(layer2, Y)
    cost_delta = mse_derivative(layer2, Y)

    #print(layer2_error)
    # In what direction is the target value?
    # Were we really close? If so, don't change too much.
    layer2_error = np.dot(cost_delta, cost_error)
    layer2_delta = layer2_error *  sigmoid_derivative(layer2)

    # Back propagation (layer2 -> layer1)
    # How much did each layer1 value contribute to the layer2 error (according to the weights)?
    layer1_error = np.dot(layer2_delta, W2.T)
    layer1_delta = layer1_error * sigmoid_derivative(layer1)

    # update weights
    W2 += - learning_rate * np.dot(layer1.T, layer2_delta)
    W1 += - learning_rate * np.dot(layer0.T, layer1_delta)

It seemed to train and learn the XOR... 似乎是在训练和学习XOR ...

But now the question begets, is the layer2_error and layer2_delta computed correctly, ie is the following part of the code correct? 但是，现在的问题是， layer2_error和layer2_delta是否正确计算，即代码的以下部分是否正确？

# How much did we miss in the predictions?
cost_error = mse(layer2, Y)
cost_delta = mse_derivative(layer2, Y)

#print(layer2_error)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_error = np.dot(cost_delta, cost_error)
layer2_delta = layer2_error *  sigmoid_derivative(layer2)

Is it correct to do a dot product on the cost_delta and cost_error for the layer2_error ? 它是正确的做对的点积cost_delta和cost_error为layer2_error ？ Or would layer2_error just be equals to cost_delta ? 还是layer2_error等于cost_delta ？

Ie 即

# How much did we miss in the predictions?
cost_error = mse(layer2, Y)
cost_delta = mse_derivative(layer2, Y)

#print(layer2_error)
# In what direction is the target value?
# Were we really close? If so, don't change too much.
layer2_error = cost_delta
layer2_delta = layer2_error *  sigmoid_derivative(layer2)

Answer 1

Yes, it is correct to multiply the residuals ( cost_error ) with the delta values when we update the weights. 是的，在我们更新权重时，将残差（ cost_error ）与增量值相乘是正确的。

However, it doesn't really matter whether do dot product or not since it cost_error is a scalar. 但是，是否进行点cost_error并不重要，因为它的cost_error是标量。 So, a simple multiplication is enough. 因此，一个简单的乘法就足够了。 But, we definitely have to multiply the gradient of the cost function because that's where we start our backprop (ie it's the entry point for backward pass). 但是，我们绝对必须乘以成本函数的梯度，因为这是我们开始反向传播的地方（即它是向后传递的入口）。

Also, the below function can be simplified: 另外，可以简化以下功能：

def mse(predicted, truth):
    return 0.5 * np.sum(np.square(predicted - truth)).mean()

as 如

def mse(predicted, truth):
    return 0.5 * np.mean(np.square(predicted - truth))

如何更新两层多层感知器中的学习率？

问题描述

Edited 已编辑

1 个解决方案

解决方案1
2 已采纳 2019-01-18 02:26:34

如何更新两层多层感知器中的学习率？

问题描述

Edited 已编辑

1 个解决方案

解决方案1 2 已采纳 2019-01-18 02:26:34

解决方案1
2 已采纳 2019-01-18 02:26:34