TensorFlow LSTM：为什么测试准确率变低，但不训练？

Question

I have tried to build LSTM model with TensorFlow.我尝试用 TensorFlow 构建 LSTM 模型。 The training of the LSTM seem to work fine, getting more than 90% accuracy. LSTM 的训练似乎运行良好，准确率超过 90%。 A problem plagued me is “test accuracy” that is very low.困扰我的一个问题是“测试准确率”非常低。 So, I thought this was due to over-fitting?所以，我认为这是由于过度拟合？ But the attempts such as increasing the training batch or reducing the element_size (from 10 to 5) were waste of my efforts, nor did the applying "dropout" solve it.但是增加训练批次或减少 element_size（从 10 到 5）等尝试都是白费力气，应用“dropout”也没有解决。 I want some directions on how to improve my code to acquire the high test accuracy.我想要一些关于如何改进我的代码以获得高测试准确性的指导。 The followings are summary of my data/parameters以下是我的数据/参数的摘要

Input variable is economic time series data standardized
Output variable is categorical features (labels) converted by one-hot encoding

Sequence_length : 20
Element_size: 5
Hidden_layer : 80
Categories (labels): 30 
Training batch : 924
Test batch : 164
Learn rate is 0.0005 (Is it low?)

Here is the code I build这是我构建的代码

#Split x_buch and y_batch
train_x,test_x=np.split(x_batch,[int(batch_size*0.85)])
train_y,test_y=np.split(y_batch,[int(batch_size*0.85)])
print('train_x shape: {0} and test_x shape: {1}'.format(train_x.shape,test_x.shape))
print('train_y shape: {0} and test_y shape: {1}'.format(train_y.shape,test_y.shape))

#Create placehold for inpt, labels
inputs=tf.placeholder(tf.float32,shape=[None,step_time,element_size],name='inputs')
y=tf.placeholder(tf.float32,shape=[None,label_num],name='y')

#Tensorflow  built-in functinon
with tf.variable_scope('lstm'):
    lstm_cell=tf.contrib.rnn.LSTMCell(hidden_layer,forget_bias=1.0)
    cell_drop=tf.contrib.rnn.DropoutWrapper(lstm_cell, output_keep_prob=0.7)
    outputs,states=tf.nn.dynamic_rnn(cell_drop,inputs,dtype=tf.float32) 
    print('outputs shape: {0}'.format(outputs.shape))

W1={'linear_layer':tf.Variable(tf.truncated_normal([hidden_layer,label_num],mean=0,stddev=.01))}
b1={'linear_layer':tf.Variable(tf.truncated_normal([label_num],mean=0,stddev=.01))}

#Extract the last relevant output and use in a linear layer
final_output=tf.matmul(outputs[:,-1,:],W1['linear_layer'])+b1['linear_layer']

with tf.name_scope('cross_entropy'):
    softmax=tf.nn.softmax_cross_entropy_with_logits(logits=final_output,labels=y)
    cross_entropy=tf.reduce_mean(softmax)

with tf.name_scope('train'):
    train_step=tf.train.AdamOptimizer(learn_rate,0.9).minimize(cross_entropy)

with tf.name_scope('accracy'):
    correct_prediction=tf.equal(tf.argmax(y,1),tf.argmax(final_output,1))
    accuracy=(tf.reduce_mean(tf.cast(correct_prediction,tf.float32)))*100

#Training
with tf.Session()as sess:
    sess.run(tf.global_variables_initializer())    
    for step in range(5000):
        sess.run(train_step,feed_dict={inputs:train_x,y:train_y})
        if step % 500 == 0:
            acc=sess.run(accuracy,feed_dict={inputs:train_x,y:train_y})
            loss=sess.run(cross_entropy,feed_dict={inputs:train_x,y:train_y})
            print('Inter'+str(step)+',Minibatch loss= '+'{:.6f}'.format(loss)+', Traning Accracy='+'{:.5f}'.format(acc))

# Test
    test_acc=sess.run(accuracy,feed_dict={inputs:test_x,y:test_y})
    print("Test Accuracy is {0}".format(test_acc))

and its result is它的结果是

Input Shape: (21760, 5)
Output Shape: (21760, 30)
x_batch shape: (1088, 20, 5)
y_batch shape: (1088, 30)
train_x shape: (924, 20, 5) and test_x shape: (164, 20, 5)
train_y shape: (924, 30) and test_y shape: (164, 30)
outputs shape: (?, 20, 80)
Inter0,Minibatch loss= 3.398923, Traning Accracy=5.30303
Inter500,Minibatch loss= 2.027734, Traning Accracy=38.09524
Inter1000,Minibatch loss= 1.340760, Traning Accracy=61.79654
Inter1500,Minibatch loss= 1.010518, Traning Accracy=72.83550
Inter2000,Minibatch loss= 0.743997, Traning Accracy=79.76190
Inter2500,Minibatch loss= 0.687736, Traning Accracy=79.76190
Inter3000,Minibatch loss= 0.475408, Traning Accracy=85.17316
Inter3500,Minibatch loss= 0.430477, Traning Accracy=87.22944
Inter4000,Minibatch loss= 0.359262, Traning Accracy=89.17749
Inter4500,Minibatch loss= 0.274463, Traning Accracy=90.69264
Test Accuracy is 4.878048419952393

I've never used TensorFlow and LSTM model, so,this is the first time, hence I know I am doing something wrong but cannot put a finger on it我从来没有使用过 TensorFlow 和 LSTM 模型，所以，这是第一次，因此我知道我做错了什么，但不能指指点点

So, Can someone provide help?那么，有人可以提供帮助吗？

Answer 1

Before I go into more details:在我详细介绍之前：
I am assuming that you are referring to batch_size when talking about element_size ?我假设您在谈论element_size时指的是batch_size ？ If I am wrong in that assumption, please correct me here.如果我的假设有误，请在这里纠正我。

As the other answer mentioned, one potential reason could be overfitting, ie you are trying "too hard with your training data".正如另一个答案所提到的，一个潜在的原因可能是过度拟合，即您“对训练数据过于努力”。 One general way to resolve this would be to keep track of the performance on unseen training data with held-back validation samples .解决此问题的一种通用方法是使用保留的验证样本跟踪未见过的训练数据的性能。 Ie, instead of splitting two-ways (train/test), you have a third validation set (usually around the same size of the testing data), and check every now and then during training how your model performs on this validation data.即，您有第三个验证集（通常与测试数据的大小相同），而不是分成两种方式（训练/测试），并在训练期间不时检查您的模型在此验证数据上的表现。

A common observation is the following curve:一个常见的观察结果是以下曲线： As you can see, the model improves constantly on the training data, but it does so, since it sacrifices the ability to generalize to unseen data .如您所见，该模型在训练数据上不断改进，但它确实如此，因为它牺牲了泛化到看不见的数据的能力。

Generally, you try to stop training at the point where the error on the validation set would be minimal - even if that does not guarantee optimal results on your training data.通常，您会尝试在验证集上的错误最小时停止训练 - 即使这并不能保证您的训练数据获得最佳结果。 We expect it to then perform best on the (completely unknown) previous test set.我们希望它在（完全未知的）之前的测试集上表现最好。

As a quick side note, if you are doing this in TensorFlow (which I am not 100% familiar with): Generally, you have to "switch" your model from training to evaluation to get the actual results on your validation set (and not accidentally train on them as well), but you can find plenty of actual implementations of this online.顺便提一下，如果您在 TensorFlow 中执行此操作（我不是 100% 熟悉）：通常，您必须将模型从训练“切换”到评估，以获得验证集的实际结果（而不是不小心也对它们进行了训练），但是您可以在网上找到很多实际的实现。

Furthermore, overfitting might be an issue if you have too many neurons!此外，如果您有太多神经元，过度拟合可能是一个问题！ In your case, you have only 800 examples, but already 80 neurons, which is IMO a ratio that is way too high.在您的情况下，您只有 800 个示例，但已经有 80 个神经元，这是 IMO 的比率太高了。 You could try using less neurons, and see if that improves the accuracy on your test set, even if that might reduce the accuracy on training data, too.您可以尝试使用更少的神经元，看看这是否会提高测试集的准确性，即使这也可能会降低训练数据的准确性。
In the end, you want to have a compact descriptor of your problem, and not a network that "learns" to recognize every single of your training instances.最后，您希望对您的问题有一个简洁的描述，而不是一个“学习”识别每个训练实例的网络。

Furthermore, if you actually do work with mini batches, you could try and reduce the number even further.此外，如果您确实使用小批量，您可以尝试进一步减少数量。 I really like this one tweet from Yann LeCun , so I will just post this here, too ;-)我真的很喜欢Yann LeCun 的这条推文，所以我也会在这里发帖 ;-)
Joke aside, training with smaller batches can lead to better generalization as well, as absurd as it sounds.撇开笑话不谈，小批量的训练也可以导致更好的泛化，这听起来很荒谬。 Large batches are generally only really helpful if you have a massive training set, or are training on a GPU (since then the copy to/from the GPU to memory is very costly, and mini batches reduce the number of such operations), or if you need a long time to reach convergence.大批量通常只有在您有大量训练集或正在 GPU 上训练时才真正有用（从那时起，从 GPU 到内存的复制成本非常高，而小批量减少了此类操作的数量），或者如果你需要很长时间才能达到收敛。

Since you are using a LSTM architecture (which, due to its sequentiality, has a similar performance on CPU and GPUs, since there is not much to be parallelized), a large batch size will likely not increase your (computational) performance, but having smaller batches might improve on the accuracy performance.由于您使用的是 LSTM 架构（由于其顺序性，它在 CPU 和 GPU 上具有相似的性能，因为没有太多可并行化的），因此大批量可能不会提高您的（计算）性能，但具有较小的批次可能会提高精度性能。

Lastly, and this is why I commented on the other answer initially, we might be completely off in this explanation here, and it could be a totally different reason after all.最后，这就是我最初评论另一个答案的原因，我们在这里的解释可能完全不同，毕竟这可能是一个完全不同的原因。

What many people tend to forget is to do some initial exploratory analysis on your test/train split.许多人往往忘记的是对您的测试/训练拆分进行一些初步的探索性分析。 If you have only representatives of one class in your test set, but barely any in your training data, the results will likely not be good.如果您的测试集中只有一个类的代表，而您的训练数据中几乎没有，结果可能不会很好。 Similarly, if you only train on 29 out of your 30 classes, it will be hard for the network to recognize any sample of the 30th class.同样，如果您只训练 30 个班级中的 29 个，网络将很难识别第 30 个班级的任何样本。

To avoid this, make sure you have a somewhat even split (ie sample a certain number of classes for each class in both test and training sets), and check if the classes are somewhat evenly distributed .为避免这种情况，请确保您有一定程度的均匀分布（即，在测试和训练集中为每个类采样一定数量的类），并检查类是否分布均匀。

Doing so might save you surprisingly much pain later, and generally helps to improve performance on completely new training data as well.这样做可能会在以后为您节省大量的痛苦，并且通常还有助于提高全新训练数据的性能。 Always remember - Deep Learning doesn't magically solve all the problems you have in predictive analysis, it just gives you a very powerful tool to tackle a specific sub-problem.永远记住 - 深度学习并不能神奇地解决您在预测分析中遇到的所有问题，它只是为您提供了一个非常强大的工具来解决特定的子问题。

Answer 2

I seem to lead to an answer in light of informative dennlinger's answer.根据内容丰富的丹林格的回答，我似乎得出了一个答案。 Fast of all, I divided the training data into six sets(x_1, x_2...x_6 and y_1, y_2, ...y_6) and each one is around same size of test data.最重要的是，我将训练数据分为六组（x_1、x_2...x_6 和 y_1、y_2、...y_6），每组的测试数据大小相同。 I'm not sure to use it as the third validation set you mentioned, but try to apply it.我不确定将它用作您提到的第三个验证集，但请尝试应用它。 What's more, I checked which classes the each set doesn't contain,for example, y_1 doesn't contain the class No.11,16,21,22, and 25更重要的是，我检查了每个集合不包含哪些类，例如 y_1 不包含类 No.11、16、21、22 和 25

train_y
[]
y_1
[11, 16, 21, 22, 25]
y_2
[11, 14, 16, 23]
y_3
[11, 19, 21, 23]
y_4
[14, 21, 23]
y_5
[16, 21, 22, 23]
y_6
[11, 21, 22, 23]
test_y
[11, 21, 22, 23]

First examination (validation) is to train on x_1/y_1 sets and compute the accuracy of the test data.Although I stop training at the each step, the performance was not improved, it was almost same result.第一个检查（validation）是在x_1/y_1集上训练，计算测试数据的准确率。虽然我在每一步都停止训练，但性能没有提高，几乎是一样的结果。

Stop at step 1000
Inter500,Minibatch loss= 1.976426, Traning Accracy=46.01227
Test Accuracy is 7.317072868347168
Stop at step1500
Inter1000,Minibatch loss= 1.098709, Traning Accracy=66.25767
Test Accuracy is 4.2682929039001465
Stop at step 2000
Inter1500,Minibatch loss= 0.906059, Traning Accracy=74.23312
Test Accuracy is 6.097560882568359
Stop at step 2500
Inter2000,Minibatch loss= 0.946361, Traning Accracy=76.07362
Test Accuracy is 6.707317352294922

Next, I tried to examine the performance on a few combinations, and the results are below接下来，我尝试检查几种组合的性能，结果如下

Train on x_6/y_6 sets and test on test data
Inter2500,Minibatch loss= 0.752621, Traning Accracy=79.77941
Test Accuracy is 78.65853881835938

Train on x_6/y_6 sets and test on x_5/y_5 sets
Inter2500,Minibatch loss= 0.772954, Traning Accracy=78.67647
Test Accuracy is 3.658536434173584

Train on training data and test on x_4/y_4 sets
Inter3000,Minibatch loss= 1.980538, Traning Accracy=41.01731
Test Accuracy is 37.42331314086914

Interestingly, a combination that was trained on x_6/y_6 sets and tested on test data could perform better than previous one, which the accuracy of test increased to about 78 percent.有趣的是，在 x_6/y_6 集上训练并在测试数据上进行测试的组合可以比之前的组合表现更好，测试准确率提高到 78% 左右。 This is, I assume, due to identical class, it mean y_6 contain all classes of test data (see above), as well as same size.这是，我假设，由于相同的类，这意味着 y_6 包含所有类的测试数据（见上文），以及相同的大小。 So, this show I have to make consideration of which data sets are suitable and try to validate LSTM model under various conditions, which is so important.所以，这个节目我必须考虑哪些数据集是合适的，并尝试在各种条件下验证 LSTM 模型，这非常重要。

On the other hand, CHG, decreasing the neurons (80 to 10 or 5) and batches, didn't improve the performance at all.另一方面，CHG，减少神经元（80 到 10 或 5）和批次，根本没有提高性能。

Answer 3

If the training accuracy continues to go up but the test accuracy goes down, then you are overfitting.如果训练准确率继续上升但测试准确率下降，那么你就是过拟合了。 Try running less epochs or use a lower learning rate.尝试运行更少的 epochs 或使用较低的学习率。

TensorFlow LSTM：为什么测试准确率变低，但不训练？

问题描述

3 个解决方案

解决方案1
1 已采纳 2018-08-20 07:49:37

解决方案2
1 2018-08-28 03:13:12

解决方案3
0

TensorFlow LSTM：为什么测试准确率变低，但不训练？

问题描述

3 个解决方案

解决方案1 1 已采纳 2018-08-20 07:49:37

解决方案2 1 2018-08-28 03:13:12

解决方案3 0

解决方案1
1 已采纳 2018-08-20 07:49:37

解决方案2
1 2018-08-28 03:13:12

解决方案3
0