为什么在多个在线培训时期之后识别率会下降？

Question

I am using tensorflow to do image recognition on the MNIST dataset. 我使用tensorflow在MNIST数据集上进行图像识别。 In each training epoch, I picked 10,000 random images and conducted online training with batch size of 1. The recognition rate increased for the first few epochs, however, after several epochs the recognition rate started to drop greatly. 在每个训练时代，我选择10,000个随机图像并进行批量大小为1的在线训练。前几个时期的识别率增加，但是，在几个时期之后识别率开始大幅下降。 (In the first 20 epochs, the recognition rate goes up to ~94%. Afterwards, the recognition rate went from 90->50->40->30->20). （在前20个时期，识别率上升到~94％。之后，识别率从90-> 50-> 40-> 30-> 20）。 What is the reason for this? 这是什么原因？

Also, with a batch size of 1, the performance is worse than when using a batch size of 100 (max recognition rate 94% vs. 96%). 此外，批量大小为1时，性能比使用100的批量大时更差（最大识别率94％对96％）。 I looked through several references but there seems to be contradictory results on whether small or large batch sizes achieve better performance. 我查看了几个参考文献，但是对于小批量或大批量大小是否达到更好的性能似乎存在矛盾的结果。 What would be this case in this situation? 在这种情况下会出现这种情况？

Edit: I also added a figure of the recognition rate of the training dataset and the test dataset. 编辑：我还添加了训练数据集和测试数据集的识别率图。 Recognition rate vs. epoch 识别率与时代的关系

I have attached a copy of the code below. 我附上了以下代码的副本。 Thanks for the help! 谢谢您的帮助！

import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)

#parameters
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_classes = 10
batch_size = 1
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')

#model of neural network
def neural_network_model(data):
    hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1])               , name='l1_w'),
                      'biases': tf.Variable(tf.random_normal([n_nodes_hl1])                    , name='l1_b')}

    hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])       , name='l2_w'),
                      'biases' :tf.Variable(tf.random_normal([n_nodes_hl2])                    , name='l2_b')}

    hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3])       , name='l3_w'),
                      'biases' :tf.Variable(tf.random_normal([n_nodes_hl3])                    , name='l3_b')}

    output_layer   = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes])     , name='lo_w'),
                      'biases' :tf.Variable(tf.random_normal([n_classes])                   , name='lo_b')}

    l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
    l1 = tf.nn.relu(l1) 
    l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
    l2 = tf.nn.relu(l2)     
    l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
    l3 = tf.nn.relu(l3)
    output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']    
return output

#train neural network
def train_neural_network(x):
    prediction = neural_network_model(x)
    cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
    optimizer = tf.train.AdamOptimizer().minimize(cost)
    hm_epoches = 100
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for epoch in range(hm_epoches):
            epoch_loss=0
            for batch in range (10000):
                epoch_x, epoch_y=mnist.train.next_batch(batch_size)                
                _,c =sess.run([optimizer, cost], feed_dict = {x:epoch_x, y:epoch_y})
                epoch_loss += c
            correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
            accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
            print(epoch_loss)
            print('Accuracy_test:', accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
            print('Accuracy_train:', accuracy.eval({x:mnist.train.images, y:mnist.train.labels}))

train_neural_network(x)

Answer 1

DROPPING ACCURACY 降低准确性

You're over-fitting. 你过度适应了。 This is when the model learns false features that are specific to artifacts of the images in the training data, at the expense of important features. 这是当模型以重要特征为代价来学习特定于训练数据中的图像的伪像的假特征时。 One of the main experimental results of any application is to determine the optimal number of training iterations. 任何应用的主要实验结果之一是确定最佳训练迭代次数。

For instance, perhaps 80% of the 7's in your training data happen to have a little extra slant to the right near the bottom of the stem, where 4's and 1's do not. 例如，你的训练数据中的7个中可能有80％碰巧在杆底部附近有一点额外倾斜，其中4和1没有。 After too much training, your model "decides" that the best way to tell a 7 from another digit is from that extra slant, despite any other features. 经过太多的训练，你的模型“决定”从另一个数字告诉7的最好方法是从那个额外的倾斜，尽管有任何其他功能。 As a result, some 1's and 4's now get classed as 7's. 结果，一些1和4现在被归类为7。

BATCH SIZE 批量大小

Again, the best batch size is one of the experimental results. 同样，最佳批量大小是实验结果之一。 Typically, a batch size of 1 is too small: this gives the first few input images too much influence on the early weights in kernel or perceptron training. 通常，批量大小为1太小：这使得前几个输入图像对内核或感知器训练中的早期权重影响太大。 This is a minor case of over-fitting: one item having undue influence on the model. 这是一个过度拟合的小例：一个项目对模型有不当影响。 However, it's significant enough to alter your best results by 2%. 但是，它足以将您的最佳结果改为2％。

You need to balance the batch size with the other hyper-parameters to find the model's "sweet spot", optimum performance followed by shortest training time. 您需要在批量大小与其他超参数之间取得平衡，以找到模型的“最佳位置”，最佳性能，然后是最短的训练时间。 In my experience, it's been best to increase the batch size until my time per image degraded. 根据我的经验，最好增加批量大小，直到每个图像的时间降低。 The models I've used most (MNIST, CIFAR-10, AlexNet, GoogleNet, ResNet, VGG, etc.) had very little loss of accuracy once we reached a rather minimal batch size; 一旦我们达到相当小的批量大小，我最常用的模型（MNIST，CIFAR-10，AlexNet，GoogleNet，ResNet，VGG等）几乎没有精度损失; from there, the training speed was usually a matter of choosing the batch size the best used available RAM. 从那里开始，训练速度通常是选择最佳使用RAM的批量大小的问题。

Answer 2

There are a few possibilities, although you'll need to do some experimentation to find out which it is. 有一些可能性，虽然你需要做一些实验来找出它是什么。

Overfitting 过度拟合

Prune did a good job of explaining this. Prune很好地解释了这一点。 I'll add that the simplest way to avoid overfitting is to just remove 10-15% of the training set and evaluate the recognition rate on this held out validation set after every few epochs. 我要补充一点，避免过度拟合的最简单方法是删除10-15％的训练集，并在每个几个时期之后评估这个保持验证集的识别率。 If you graph the change in recognition rate on both the training and validation sets, you'll eventually reach a point on the graph where the training error keeps going down but the validation error starts going up. 如果您在训练集和验证集上绘制识别率的变化图表，您最终会在图表上找到训练错误持续下降但验证错误开始上升的点。 Stop training at that point; 那时停止训练; that's where overfitting is starting in earnest. 这就是过度拟合正在认真开始的地方。 Note that it's important that there be no overlap between the training/validation/test sets. 请注意，训练/验证/测试集之间没有重叠是很重要的。

This was more likely before you mentioned that the training error wasn't also decreasing, but it's possible that it's overfitting on a fairly homogeneous part of your training set at the expense of the outliers, or something like this. 在您提到训练误差也没有减少之前，这更有可能发生，但是它有可能在训练集的相当均匀的部分过度拟合而牺牲了异常值，或类似的东西。 Try randomizing the order of your training set after each epoch; 尝试在每个时代之后随机化训练集的顺序; if it's fitting one section of the set at the expense of the others, this might help. 如果它以牺牲其他部分为代价来装配该组的一部分，这可能会有所帮助。

Addendum: The massive instantaneous drop in quality around epoch 20 makes this even less likely; 附录：20世纪左右的质量瞬间下降使得这种情况更加可能; that is not what overfitting looks like. 这不是过度拟合的样子。

Numerical Instability 数值不稳定

If you get a particularly incorrect input at a point on the activation function with a large gradient, it's possible to end up with a gigantic weight update that screws up everything it's learned thus far. 如果你在激活函数上的一个点上得到一个特别不正确的输入，并且有一个很大的梯度，那么最终可能会有一个巨大的重量更新，它会搞砸到目前为止所学到的一切。 It's common to put a hard limit on the gradient magnitude for this reason. 由于这个原因，通常对梯度幅度施加硬限制。 But you're using AdamOptimizer, which has an epsilon parameter for avoiding instability. 但是你正在使用AdamOptimizer，它有一个epsilon参数来避免不稳定。 I haven't read the paper it references, so I don't know exactly how it works, but the fact that it's there makes instability less likely. 我没有读过它所引用的论文，所以我不确切知道它是如何工作的，但它确实存在不稳定性这一事实。

Saturated Neurons 饱和神经元

Some activation functions have regions with very small gradients, so if you end up with weights such that the function is almost always in that region, you have a tiny gradient and thus can't learn effectively. 一些激活函数具有非常小的渐变的区域，因此如果您最终得到权重使得该函数几乎总是在该区域中，则您具有微小的渐变，因此无法有效地学习。 Sigmoids and Tanh are particularly prone to this since they have flat regions on both sides of the function. Sigmoids和Tanh特别容易出现这种情况，因为它们在功能的两侧都有平坦的区域。 ReLUs don't have a flat region on the high end, but do on the low end. ReLU在高端没有平坦区域，但在低端没有。 Try replacing your activation functions with Softplus; 尝试用Softplus替换激活功能; those are similar to ReLU, but with a continuous nonzero gradient. 那些类似于ReLU，但具有连续的非零梯度。

为什么在多个在线培训时期之后识别率会下降？

问题描述

2 个解决方案

解决方案1
2 2017-08-01 22:01:19

解决方案2
1 2017-08-01 22:59:18

为什么在多个在线培训时期之后识别率会下降？

问题描述

2 个解决方案

解决方案1 2 2017-08-01 22:01:19

解决方案2 1 2017-08-01 22:59:18

解决方案1
2 2017-08-01 22:01:19

解决方案2
1 2017-08-01 22:59:18