[英]Neural Network: Mysterious ReLu
I've been building a programming language detector, ie, a classifier of code snippets, as part of a bigger project. 我一直在构建一个编程语言检测器,即代码片段的分类器,作为更大项目的一部分。 My baseline model is pretty straight-forward: tokenize the input and encode the snippets as bag-of-words or, in this case, bag-of-tokens , and make a simple NN on top of these features. 我的基线模型非常简单:将输入标记化并将片段编码为单词包,或者在这种情况下为标记包 ,并在这些功能之上创建一个简单的NN。
The input to NN is a fixed-length array of counters of most distinctive tokens, such as "def"
, "self"
, "function"
, "->"
, "const"
, "#include"
, etc., that are automatically extracted from the corpus. NN的输入是大多数独特令牌的固定长度计数器阵列,例如"def"
, "self"
, "function"
, "->"
, "const"
, "#include"
等,它们是自动从语料库中提取。 The idea is that these tokens are pretty unique to programming languages, so even this naive approach should get high accuracy score. 这个想法是这些令牌对于编程语言来说是非常独特的,所以即使是这种天真的方法也应该获得高准确度。
Input:
def 1
for 2
in 2
True 1
): 3
,: 1
...
Output: python
I got 99% accuracy pretty quickly and decided that's the sign that it works just as expected. 我很快就获得了99%的准确率,并认为这是符合预期的标志。 Here's the model (a full runnable script is here ): 这是模型(完整的可运行脚本在这里 ):
# Placeholders
x = tf.placeholder(shape=[None, vocab_size], dtype=tf.float32, name='x')
y = tf.placeholder(shape=[None], dtype=tf.int32, name='y')
training = tf.placeholder_with_default(False, shape=[], name='training')
# One hidden layer with dropout
reg = tf.contrib.layers.l2_regularizer(0.01)
hidden1 = tf.layers.dense(x, units=96, kernel_regularizer=reg,
activation=tf.nn.elu, name='hidden1')
dropout1 = tf.layers.dropout(hidden1, rate=0.2, training=training, name='dropout1')
# Output layer
logits = tf.layers.dense(dropout1, units=classes, kernel_regularizer=reg,
activation=tf.nn.relu, name='logits')
# Cross-entropy loss
loss = tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, abels=y))
# Misc reports: accuracy, correct/misclassified samples, etc.
correct_predicted = tf.nn.in_top_k(logits, y, 1, name='in-top-k')
prediction = tf.argmax(logits, axis=1)
wrong_predicted = tf.logical_not(correct_predicted, name='not-in-top-k')
x_misclassified = tf.boolean_mask(x, wrong_predicted, name='misclassified')
accuracy = tf.reduce_mean(tf.cast(correct_predicted, tf.float32), name='accuracy')
The output is pretty encouraging: 输出非常令人鼓舞:
iteration=5 loss=2.580 train-acc=0.34277
iteration=10 loss=2.029 train-acc=0.69434
iteration=15 loss=2.054 train-acc=0.92383
iteration=20 loss=1.934 train-acc=0.98926
iteration=25 loss=1.942 train-acc=0.99609
Files.VAL mean accuracy = 0.99121 <-- After just 1 epoch!
iteration=30 loss=1.943 train-acc=0.99414
iteration=35 loss=1.947 train-acc=0.99512
iteration=40 loss=1.946 train-acc=0.99707
iteration=45 loss=1.946 train-acc=0.99609
iteration=50 loss=1.944 train-acc=0.99902
iteration=55 loss=1.946 train-acc=0.99902
Files.VAL mean accuracy = 0.99414
Test accuracy was also around 1.0. 测试精度也在1.0左右。 Everything looked perfect. 一切看起来都很完美。
But then I noticed that I put activation=tf.nn.relu
into the final dense layer ( logits
), which is clearly a bug : there is no need to discard negative scores before softmax
, because they indicate the classes with low probability. 但后来我注意到我将activation=tf.nn.relu
放入最后的密集层( logits
),这显然是一个错误 :在softmax
之前不需要丢弃负分数,因为它们表示概率较低的类。 Zero threshold will only make these classes artificially more probable, which would be a mistake. 零门槛只会使这些类人为地更加可能,这将是一个错误。 Getting rid of it should only make the model more robust and confident in the correct class. 摆脱它应该只会使模型在正确的类中更加健壮和自信。
That's what I thought. 那正是我所想。 So I replaced it with activation=None
, run the model again and then a surprising thing happened: the performance didn't improve. 所以我用activation=None
替换它,再次运行模型然后发生了一件令人惊讶的事情:性能没有提高。 At all. 完全没有。 In fact, it degraded significantly : 事实上,它显着下降 :
iteration=5 loss=5.236 train-acc=0.16602
iteration=10 loss=4.068 train-acc=0.18750
iteration=15 loss=3.110 train-acc=0.37402
iteration=20 loss=5.149 train-acc=0.14844
iteration=25 loss=2.880 train-acc=0.18262
Files.VAL mean accuracy = 0.28711
iteration=30 loss=3.136 train-acc=0.25781
iteration=35 loss=2.916 train-acc=0.22852
iteration=40 loss=2.156 train-acc=0.39062
iteration=45 loss=1.777 train-acc=0.45312
iteration=50 loss=2.726 train-acc=0.33105
Files.VAL mean accuracy = 0.29362
The accuracy got better with training, but never surpassed 91-92%. 训练的准确性越来越好,但从未超过91-92%。 I changed the activation back and forth several times, varying different parameters (layer size, dropout, regularizer, extra layers, anything) and always had the same outcome: the "wrong" model hit 99% immediately, while the "right" model barely achieved 90% after 50 epochs . 我来回多次改变激活,改变不同的参数(图层大小,丢失,正规化,额外的图层,任何东西)并且总是有相同的结果: “错误”模型立即达到99%,而“正确”模型几乎没有50个时代后达到了90% 。 According to tensorboard, there was no big difference in weight distribution: the gradients didn't die out and both models learned normally. 根据张量板,体重分布没有太大差异:梯度没有消失,两种模型都能正常学习。
How is this possible? 这怎么可能? How can the final ReLu make a model so much superior? 最终的ReLu如何使模型如此优越? Especially if this ReLu is a bug? 特别是如果这个ReLu是一个bug?
After playing around with it for a while, I decided to visualize the actual prediction distribution for both models: 在玩了一会儿之后,我决定想象两种模型的实际预测分布:
predicted_distribution = tf.nn.softmax(logits, name='distribution')
Below are the histograms of the distributions and how they evolved over time. 以下是分布的直方图以及它们随时间的演变。
With ReLu (wrong model) 使用ReLu(错误型号)
Without ReLu (correct model) 没有ReLu(正确型号)
The first histogram makes sense, most of probabilities are close to 0
. 第一个直方图是有意义的,大多数概率接近于0
。 But the histogram of the ReLu model is suspicious : the values seem to concentrate around 0.15
after few iterations. 但是ReLu模型的直方图是可疑的 :经过几次迭代后,这些值似乎集中在0.15
左右。 Printing the actual predictions confirmed this idea: 打印实际预测证实了这个想法:
[0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286]
[0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286]
I had 7 classes (for 7 different languages at that moment) and 0.14286
is 1/7
. 我有7个班级(当时有7种不同的语言), 0.14286
是1/7
。 It turns out, the "perfect" model learned to output 0
logits, which in turn translated in uniform prediction. 事实证明,“完美”模型学会了输出0
logits,而后者又转换为统一预测。
But how can this distribution be reported as 99% accurate? 但是,如何将此分布报告为99%准确?
tf.nn.in_top_k
Before diving into tf.nn.in_top_k
I checked an alternative way to compute accuracy: 在深入tf.nn.in_top_k
之前,我检查了另一种计算准确度的方法:
true_correct = tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64))
alternative_accuracy = tf.reduce_mean(tf.cast(true_correct, tf.float32))
... which performs honest comparison of the highest predicted class and the ground truth. ...对最高预测班级和基本事实进行诚实比较。 The result is this: 结果是这样的:
iteration=2 loss=3.992 train-acc=0.13086 train-alt-acc=0.13086
iteration=4 loss=3.590 train-acc=0.13086 train-alt-acc=0.12207
iteration=6 loss=2.871 train-acc=0.21777 train-alt-acc=0.13672
iteration=8 loss=2.466 train-acc=0.37695 train-alt-acc=0.16211
iteration=10 loss=2.099 train-acc=0.62305 train-alt-acc=0.10742
iteration=12 loss=2.066 train-acc=0.79980 train-alt-acc=0.17090
iteration=14 loss=2.016 train-acc=0.84277 train-alt-acc=0.17285
iteration=16 loss=1.954 train-acc=0.91309 train-alt-acc=0.13574
iteration=18 loss=1.956 train-acc=0.95508 train-alt-acc=0.06445
iteration=20 loss=1.923 train-acc=0.97754 train-alt-acc=0.11328
Indeed, tf.nn.in_top_k
with k=1
diverged from the right accuracy quickly and began to report fantasized 99% values. 实际上, k=1
tf.nn.in_top_k
很快就偏离了正确的准确度,并开始报告幻想的99%值。 So what does it do actually? 那它实际上做了什么? Here's what the documentation says about it: 以下是文档中所说的内容:
Says whether the targets are in the top K predictions. 说目标是否在前K个预测中。
This outputs a
batch_size
bool array, an entryout[i]
is true if the prediction for the target class is among the top k predictions among all predictions for example i. 这输出了一个batch_size
bool数组,如果目标类的预测是所有预测中的前k个预测,例如i,则out[i]
为真。 Note that the behavior ofInTopK
differs from theTopK
op in its handling of ties; 请注意,InTopK
的行为在处理关系时与TopK
op不同; if multiple classes have the same prediction value and straddle the top-k boundary, all of those classes are considered to be in the top k . 如果多个类具有相同的预测值并跨越top-k边界,则所有这些类都被认为是在前k个 。
That's what it is. 就是这样。 If the probabilities are uniform (which actually means "I have no idea"), they are all correct. 如果概率是统一的(实际上意味着“我不知道”),它们都是正确的。 The situation is even worse, because if the logits distribution is almost uniform, softmax may transform it into exactly uniform distribution, as can be seen in this simple example: 情况更糟,因为如果logits分布几乎是均匀的,softmax可能会将其转换为完全均匀的分布,如下面的简单示例所示:
x = tf.constant([0, 1e-8, 1e-8, 1e-9])
tf.nn.softmax(x).eval()
# >>> array([0.25, 0.25, 0.25, 0.25], dtype=float32)
... which means that every nearly uniform prediction, may be considered "correct" according to tf.nn.in_top_k
spec. ......这意味着根据tf.nn.in_top_k
规范,每次几乎统一的预测都可以被认为是“正确的”。
tf.nn.in_top_k
is a dangerous choice of accuracy measure in tensorflow, because it may silently swallow wrong predictions and report them as "correct". tf.nn.in_top_k
是张量流中准确性度量的危险选择,因为它可能会默默地吞下错误的预测并将其报告为“正确”。 Instead, you should always use this long but trusted expression: 相反,您应该始终使用这个长而可信的表达式:
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64)), tf.float32))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.