辍学：为什么我的神经网络不起作用？

Question

我正在使用归一化的MNIST数据集（输入要素= 784 ）。 我的网络体系结构是784-256-256-10 ：256个神经元的两个隐藏层，每个层使用S型激活函数，在10个神经元输出层使用softmax激活。 另外，我正在使用交叉熵成本函数。

权重矩阵初始化：

input_size=784
hidden1_size=256
hidden2_size=256
output_size=10
Theta1 = np.random.randn(hidden1_size, input_size)
b1 = np.random.randn(hidden1_size)

Theta2 = np.random.randn(hidden2_size, hidden1_size)
b2 = np.random.randn(hidden2_size)

Theta3 = np.random.randn(output_size, hidden2_size)
b3 = np.random.randn(output_size)

我的网络在这里按预期工作：

epochs = 2000
learning_rate = 0.01
for j in range(epochs):
    # total_train is an array of length 50000
    # Each element of total_train is a tuple of: (a) input vector of length 784
    # and (b) the corresponding one-hot encoded label of length 10
    # Similarly, total_test is an array of length 10000
    shuffle(total_train)
    train = total_train[:1000]
    shuffle(total_test)
    test = total_test[:1000]
    predictions = []
    test_predictions = []
    for i in range(len(train)):
        # Feed forward
        x, t = train[i][0], train[i][1]
        z1 = np.dot(Theta1, x) + b1
        a1 = sigmoid(z1)
        z2 = np.dot(Theta2, a1) + b2
        a2 = sigmoid(z2)
        z3 = np.dot(Theta3, a2) + b3
        y = softmax(z3)
        # Is prediction == target?
        predictions.append(np.argmax(y) == np.argmax(t))

        # Negative log probability cost function
        cost = -t * np.log(y)

        # Backpropagation
        delta3 = (y - t) * softmax_prime(z3)
        dTheta3 = np.outer(delta3, a2)
        db3 = delta3

        delta2 = np.dot(Theta3.T, delta3) * sigmoid_prime(z2)
        dTheta2 = np.outer(delta2, a1)
        db2 = delta2

        delta1 = np.dot(Theta2.T, delta2) * sigmoid_prime(z1)
        dTheta1 = np.outer(delta1, x)
        db1 = delta1

        # Update weights
        Theta1 -= learning_rate * dTheta1
        b1 -= learning_rate * db1
        Theta2 -= learning_rate * dTheta2
        b2 -= learning_rate * db2
        Theta3 -= learning_rate * dTheta3
        b3 -= learning_rate * db3

    if j % 10 == 0:
        m = len(predictions)
        performance = sum(predictions)/m
        print('Epoch:', j, 'Train performance:', performance)

    # Test accuracy on test data
    for i in range(len(test)):
        # Feed forward
        x, t = test[i][0], test[i][1]
        z1 = np.dot(Theta1, x) + b1
        a1 = sigmoid(z1)
        z2 = np.dot(Theta2, a1) + b2
        a2 = sigmoid(z2)
        z3 = np.dot(Theta3, a2) + b3
        y = softmax(z3)
        # Is prediction == target?
        test_predictions.append(np.argmax(y) == np.argmax(t))

    m = len(test_predictions)
    performance = sum(test_predictions)/m
    print('Epoch:', j, 'Test performance:', performance)

输出（每10个周期）：

Epoch: 0 Train performance: 0.121
Epoch: 0 Test performance: 0.146
Epoch: 10 Train performance: 0.37
Epoch: 10 Test performance: 0.359
Epoch: 20 Train performance: 0.41
Epoch: 20 Test performance: 0.433
Epoch: 30 Train performance: 0.534
Epoch: 30 Test performance: 0.52
Epoch: 40 Train performance: 0.607
Epoch: 40 Test performance: 0.601
Epoch: 50 Train performance: 0.651
Epoch: 50 Test performance: 0.669
Epoch: 60 Train performance: 0.71
Epoch: 60 Test performance: 0.711
Epoch: 70 Train performance: 0.719
Epoch: 70 Test performance: 0.694
Epoch: 80 Train performance: 0.75
Epoch: 80 Test performance: 0.752
Epoch: 90 Train performance: 0.76
Epoch: 90 Test performance: 0.758
Epoch: 100 Train performance: 0.766
Epoch: 100 Test performance: 0.769

但是当我引入Dropout正则化方案时，我的网络中断了。 我的辍学代码更新为：

dropout_prob = 0.5

# Feed forward
x, t = train[i][0], train[i][1]
z1 = np.dot(Theta1, x) + b1
a1 = sigmoid(z1)
mask1 = np.random.random(len(z1))
mask1 = mask1 < dropout_prob
a1 *= mask1
z2 = np.dot(Theta2, a1) + b2
a2 = sigmoid(z2)
mask2 = np.random.random(len(z2))
mask2 = mask2 < dropout_prob
a2 *= mask2
z3 = np.dot(Theta3, a2) + b3
y = softmax(z3)

# Backpropagation
delta3 = (y - t) * softmax_prime(z3)
dTheta3 = np.outer(delta3, a2)
db3 = delta3 * 1

delta2 = np.dot(Theta3.T, delta3) * sigmoid_prime(z2)
dTheta2 = np.outer(delta2, a1)
db2 = delta2 * 1

delta1 = np.dot(Theta2.T, delta2) * sigmoid_prime(z1)
dTheta1 = np.outer(delta1, x)
db1 = delta1 * 1

性能保持在约0.1 （10％）。

非常感谢任何关于我要去哪里的指示。

Answer 1

实施辍学存在一个主要问题，因为您没有按测试时间扩展激活 。 这是伟大的CS231n教程的引文：

重要的是，请注意，在predict函数中，我们不再丢弃，但是我们将两个隐藏层的输出按p缩放。

这很重要，因为在测试时所有神经元都会看到其所有输入，因此我们希望在测试时神经元的输出与训练时它们的预期输出相同。 例如，在p=0.5情况下，神经元必须在测试时将其输出减半，以使其具有与训练期间（预期）相同的输出。

要看到这一点，请考虑神经元x的输出（在退出之前）。 带有dropout时，该神经元的预期输出将变为px+(1−p)0 ，因为神经元的输出将以1−p概率设置为零。 在测试时，当我们使神经元始终处于活动状态时，必须调整x→px以保持相同的预期输出。

还可以证明，在测试时执行此衰减可能与以下过程有关：对所有可能的二进制掩码（以及因此所有指数级的许多子网络）进行迭代并计算其整体预测。

最常见的解决方案是使用倒置的dropout ，它在训练时执行缩放，而在测试时保持正向通过不变。 这就是代码中的样子：

mask1 = (mask1 < dropout_prob) / dropout_prob
...
mask2 = (mask2 < dropout_prob) / dropout_prob
...

辍学：为什么我的神经网络不起作用？

问题描述

1 个解决方案

解决方案1
0 2017-12-08 12:29:35

辍学：为什么我的神经网络不起作用？

问题描述

1 个解决方案

解决方案1 0 2017-12-08 12:29:35

解决方案1
0 2017-12-08 12:29:35