神经网络：神秘的ReLu

Question

I've been building a programming language detector, ie, a classifier of code snippets, as part of a bigger project. 我一直在构建一个编程语言检测器，即代码片段的分类器，作为更大项目的一部分。 My baseline model is pretty straight-forward: tokenize the input and encode the snippets as bag-of-words or, in this case, bag-of-tokens , and make a simple NN on top of these features. 我的基线模型非常简单：将输入标记化并将片段编码为单词包，或者在这种情况下为标记包 ，并在这些功能之上创建一个简单的NN。

The input to NN is a fixed-length array of counters of most distinctive tokens, such as "def" , "self" , "function" , "->" , "const" , "#include" , etc., that are automatically extracted from the corpus. NN的输入是大多数独特令牌的固定长度计数器阵列，例如"def" ， "self" ， "function" ， "->" ， "const" ， "#include"等，它们是自动从语料库中提取。 The idea is that these tokens are pretty unique to programming languages, so even this naive approach should get high accuracy score. 这个想法是这些令牌对于编程语言来说是非常独特的，所以即使是这种天真的方法也应该获得高准确度。

Input:
  def   1
  for   2
  in    2
  True  1
  ):    3
  ,:    1

  ...

Output: python

Setup 设定

I got 99% accuracy pretty quickly and decided that's the sign that it works just as expected. 我很快就获得了99％的准确率，并认为这是符合预期的标志。 Here's the model (a full runnable script is here ): 这是模型（完整的可运行脚本在这里）：

# Placeholders
x = tf.placeholder(shape=[None, vocab_size], dtype=tf.float32, name='x')
y = tf.placeholder(shape=[None], dtype=tf.int32, name='y')
training = tf.placeholder_with_default(False, shape=[], name='training')

# One hidden layer with dropout
reg = tf.contrib.layers.l2_regularizer(0.01)
hidden1 = tf.layers.dense(x, units=96, kernel_regularizer=reg, 
                          activation=tf.nn.elu, name='hidden1')
dropout1 = tf.layers.dropout(hidden1, rate=0.2, training=training, name='dropout1')

# Output layer
logits = tf.layers.dense(dropout1, units=classes, kernel_regularizer=reg,
                         activation=tf.nn.relu, name='logits')

# Cross-entropy loss
loss = tf.reduce_mean(
    tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, abels=y))

# Misc reports: accuracy, correct/misclassified samples, etc.
correct_predicted = tf.nn.in_top_k(logits, y, 1, name='in-top-k')
prediction = tf.argmax(logits, axis=1)
wrong_predicted = tf.logical_not(correct_predicted, name='not-in-top-k')
x_misclassified = tf.boolean_mask(x, wrong_predicted, name='misclassified')
accuracy = tf.reduce_mean(tf.cast(correct_predicted, tf.float32), name='accuracy')

The output is pretty encouraging: 输出非常令人鼓舞：

iteration=5  loss=2.580  train-acc=0.34277
iteration=10  loss=2.029  train-acc=0.69434
iteration=15  loss=2.054  train-acc=0.92383
iteration=20  loss=1.934  train-acc=0.98926
iteration=25  loss=1.942  train-acc=0.99609
Files.VAL mean accuracy = 0.99121             <-- After just 1 epoch!

iteration=30  loss=1.943  train-acc=0.99414
iteration=35  loss=1.947  train-acc=0.99512
iteration=40  loss=1.946  train-acc=0.99707
iteration=45  loss=1.946  train-acc=0.99609
iteration=50  loss=1.944  train-acc=0.99902
iteration=55  loss=1.946  train-acc=0.99902
Files.VAL mean accuracy = 0.99414

Test accuracy was also around 1.0. 测试精度也在1.0左右。 Everything looked perfect. 一切看起来都很完美。

Mysterious ReLu 神秘的ReLu

But then I noticed that I put activation=tf.nn.relu into the final dense layer ( logits ), which is clearly a bug : there is no need to discard negative scores before softmax , because they indicate the classes with low probability. 但后来我注意到我将activation=tf.nn.relu放入最后的密集层（ logits ），这显然是一个错误 ：在softmax之前不需要丢弃负分数，因为它们表示概率较低的类。 Zero threshold will only make these classes artificially more probable, which would be a mistake. 零门槛只会使这些类人为地更加可能，这将是一个错误。 Getting rid of it should only make the model more robust and confident in the correct class. 摆脱它应该只会使模型在正确的类中更加健壮和自信。

That's what I thought. 那正是我所想。 So I replaced it with activation=None , run the model again and then a surprising thing happened: the performance didn't improve. 所以我用activation=None替换它，再次运行模型然后发生了一件令人惊讶的事情：性能没有提高。 At all. 完全没有。 In fact, it degraded significantly : 事实上，它显着下降 ：

iteration=5  loss=5.236  train-acc=0.16602
iteration=10  loss=4.068  train-acc=0.18750
iteration=15  loss=3.110  train-acc=0.37402
iteration=20  loss=5.149  train-acc=0.14844
iteration=25  loss=2.880  train-acc=0.18262
Files.VAL mean accuracy = 0.28711

iteration=30  loss=3.136  train-acc=0.25781
iteration=35  loss=2.916  train-acc=0.22852
iteration=40  loss=2.156  train-acc=0.39062
iteration=45  loss=1.777  train-acc=0.45312
iteration=50  loss=2.726  train-acc=0.33105
Files.VAL mean accuracy = 0.29362

The accuracy got better with training, but never surpassed 91-92%. 训练的准确性越来越好，但从未超过91-92％。 I changed the activation back and forth several times, varying different parameters (layer size, dropout, regularizer, extra layers, anything) and always had the same outcome: the "wrong" model hit 99% immediately, while the "right" model barely achieved 90% after 50 epochs . 我来回多次改变激活，改变不同的参数（图层大小，丢失，正规化，额外的图层，任何东西）并且总是有相同的结果： “错误”模型立即达到99％，而“正确”模型几乎没有50个时代后达到了90％ 。 According to tensorboard, there was no big difference in weight distribution: the gradients didn't die out and both models learned normally. 根据张量板，体重分布没有太大差异：梯度没有消失，两种模型都能正常学习。

How is this possible? 这怎么可能？ How can the final ReLu make a model so much superior? 最终的ReLu如何使模型如此优越？ Especially if this ReLu is a bug? 特别是如果这个ReLu是一个bug？

Answer 1

Prediction distribution 预测分布

After playing around with it for a while, I decided to visualize the actual prediction distribution for both models: 在玩了一会儿之后，我决定想象两种模型的实际预测分布：

predicted_distribution = tf.nn.softmax(logits, name='distribution')

Below are the histograms of the distributions and how they evolved over time. 以下是分布的直方图以及它们随时间的演变。

With ReLu (wrong model) 使用ReLu（错误型号）

Without ReLu (correct model) 没有ReLu（正确型号）

The first histogram makes sense, most of probabilities are close to 0 . 第一个直方图是有意义的，大多数概率接近于0 。 But the histogram of the ReLu model is suspicious : the values seem to concentrate around 0.15 after few iterations. 但是ReLu模型的直方图是可疑的 ：经过几次迭代后，这些值似乎集中在0.15左右。 Printing the actual predictions confirmed this idea: 打印实际预测证实了这个想法：

[0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286]
[0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286]

I had 7 classes (for 7 different languages at that moment) and 0.14286 is 1/7 . 我有7个班级（当时有7种不同的语言）， 0.14286是1/7 。 It turns out, the "perfect" model learned to output 0 logits, which in turn translated in uniform prediction. 事实证明，“完美”模型学会了输出0 logits，而后者又转换为统一预测。

But how can this distribution be reported as 99% accurate? 但是，如何将此分布报告为99％准确？

`tf.nn.in_top_k`

Before diving into tf.nn.in_top_k I checked an alternative way to compute accuracy: 在深入tf.nn.in_top_k之前，我检查了另一种计算准确度的方法：

true_correct = tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64))
alternative_accuracy = tf.reduce_mean(tf.cast(true_correct, tf.float32))

... which performs honest comparison of the highest predicted class and the ground truth. ...对最高预测班级和基本事实进行诚实比较。 The result is this: 结果是这样的：

iteration=2  loss=3.992  train-acc=0.13086  train-alt-acc=0.13086
iteration=4  loss=3.590  train-acc=0.13086  train-alt-acc=0.12207
iteration=6  loss=2.871  train-acc=0.21777  train-alt-acc=0.13672
iteration=8  loss=2.466  train-acc=0.37695  train-alt-acc=0.16211
iteration=10  loss=2.099  train-acc=0.62305  train-alt-acc=0.10742
iteration=12  loss=2.066  train-acc=0.79980  train-alt-acc=0.17090
iteration=14  loss=2.016  train-acc=0.84277  train-alt-acc=0.17285
iteration=16  loss=1.954  train-acc=0.91309  train-alt-acc=0.13574
iteration=18  loss=1.956  train-acc=0.95508  train-alt-acc=0.06445
iteration=20  loss=1.923  train-acc=0.97754  train-alt-acc=0.11328

Indeed, tf.nn.in_top_k with k=1 diverged from the right accuracy quickly and began to report fantasized 99% values. 实际上， k=1 tf.nn.in_top_k很快就偏离了正确的准确度，并开始报告幻想的99％值。 So what does it do actually? 那它实际上做了什么？ Here's what the documentation says about it: 以下是文档中所说的内容：

Says whether the targets are in the top K predictions. 说目标是否在前K个预测中。

This outputs a batch_size bool array, an entry out[i] is true if the prediction for the target class is among the top k predictions among all predictions for example i. 这输出了一个batch_size bool数组，如果目标类的预测是所有预测中的前k个预测，例如i，则out[i]为真。 Note that the behavior of InTopK differs from the TopK op in its handling of ties; 请注意， InTopK的行为在处理关系时与TopK op不同; if multiple classes have the same prediction value and straddle the top-k boundary, all of those classes are considered to be in the top k . 如果多个类具有相同的预测值并跨越top-k边界，则所有这些类都被认为是在前k个 。

That's what it is. 就是这样。 If the probabilities are uniform (which actually means "I have no idea"), they are all correct. 如果概率是统一的（实际上意味着“我不知道”），它们都是正确的。 The situation is even worse, because if the logits distribution is almost uniform, softmax may transform it into exactly uniform distribution, as can be seen in this simple example: 情况更糟，因为如果logits分布几乎是均匀的，softmax可能会将其转换为完全均匀的分布，如下面的简单示例所示：

x = tf.constant([0, 1e-8, 1e-8, 1e-9])
tf.nn.softmax(x).eval()
# >>> array([0.25, 0.25, 0.25, 0.25], dtype=float32)

... which means that every nearly uniform prediction, may be considered "correct" according to tf.nn.in_top_k spec. ......这意味着根据tf.nn.in_top_k规范，每次几乎统一的预测都可以被认为是“正确的”。

Conclusion 结论

tf.nn.in_top_k is a dangerous choice of accuracy measure in tensorflow, because it may silently swallow wrong predictions and report them as "correct". tf.nn.in_top_k是张量流中准确性度量的危险选择，因为它可能会默默地吞下错误的预测并将其报告为“正确”。 Instead, you should always use this long but trusted expression: 相反，您应该始终使用这个长而可信的表达式：

accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64)), tf.float32))

神经网络：神秘的ReLu

问题描述

Setup 设定

Mysterious ReLu 神秘的ReLu

1 个解决方案

解决方案1
8 已采纳 2018-02-26 16:35:57

Prediction distribution 预测分布

`tf.nn.in_top_k`

Conclusion 结论

神经网络：神秘的ReLu

问题描述

Setup 设定

Mysterious ReLu 神秘的ReLu

1 个解决方案

解决方案1 8 已采纳 2018-02-26 16:35:57

Prediction distribution 预测分布

tf.nn.in_top_k

Conclusion 结论

解决方案1
8 已采纳 2018-02-26 16:35:57

`tf.nn.in_top_k`