简体   繁体   English

A Keras model 训练良好,但预测的值相同

[英]A Keras model trains well, but predicts the same value

Let's try to make MobileNet V. 2 locate a bright band on a noisy image.让我们尝试让MobileNet V. 2在嘈杂的图像上定位一个亮带。 Yes, it is overkill to use a deep convolutional network for such a tack, but originally it was intended just like a smoke test to make sure the model works.是的,将深度卷积网络用于这种策略是过大的,但最初它的目的就像烟雾测试一样,以确保 model 正常工作。 We will train it on synthetic data:我们将在合成数据上对其进行训练:

import numpy as np
import tensorflow as tf
from matplotlib import pyplot as plt

SHAPE = (32, 320, 1)
def gen_sample():
    while True:
        data = np.random.normal(0, 1, SHAPE)
        i = np.random.randint(0, SHAPE[1]-8)
        data[:,i:i+8,:] += 4
        yield data.astype(np.float32), np.float32(i)

ds = tf.data.Dataset.from_generator(gen_sample, output_signature=(
    tf.TensorSpec(shape=SHAPE, dtype=tf.float32),
    tf.TensorSpec(shape=(), dtype=tf.float32))).batch(100)

d, i = next(gen_sample())
plt.figure()
plt.imshow(d)
plt.show()

示例图像

Now we build and train a model:现在我们构建并训练一个 model:

model = tf.keras.models.Sequential([
    tf.keras.applications.MobileNetV2(
        input_shape=SHAPE, include_top=False, weights=None, alpha=0.5),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1)
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(
        learning_rate=tf.keras.optimizers.schedules.ExponentialDecay(
            initial_learning_rate=0.01, decay_steps=1000, decay_rate=0.9)),
    loss='mean_squared_error')
history = model.fit(ds, steps_per_epoch=10, epochs=40)

We use generated data, so we don't need a validation set, do we?我们使用生成的数据,所以我们不需要验证集,对吗? So we can just watch how the loss decreases.所以我们可以看看损失是如何减少的。 And it does decrease decently well:它确实下降得很好:

Epoch 1/40
10/10 [==============================] - 27s 2s/step - loss: 15054.8417
Epoch 2/40
10/10 [==============================] - 23s 2s/step - loss: 193.9126
Epoch 3/40
10/10 [==============================] - 24s 2s/step - loss: 76.9586
Epoch 4/40
10/10 [==============================] - 25s 2s/step - loss: 68.8521
...
Epoch 37/40
10/10 [==============================] - 20s 2s/step - loss: 4.5258
Epoch 38/40
10/10 [==============================] - 20s 2s/step - loss: 22.1212
Epoch 39/40
10/10 [==============================] - 20s 2s/step - loss: 28.4854
Epoch 40/40
10/10 [==============================] - 20s 2s/step - loss: 18.0123

Training happened to stop not at the best result, but it still should be reasonable: the answers should be around the true value ±8.训练碰巧不是在最佳结果时停止,但它仍然应该是合理的:答案应该在真值 ±8 附近。 Let's test it:让我们测试一下:

d, i = list(ds.take(1))[0]
model.evaluate(d, i)
np.stack((model.predict(d).ravel(), i.numpy()), 1)[:10,]
4/4 [==============================] - 0s 32ms/step - loss: 16955.7871
array([[ 66.84666 , 222.      ],
       [ 66.846664,  46.      ],
       [ 66.846664,  71.      ],
       [ 66.84668 , 268.      ],
       [ 66.846664,  86.      ],
       [ 66.84668 , 121.      ],
       [ 66.846664, 301.      ],
       [ 66.84667 , 106.      ],
       [ 66.84665 , 138.      ],
       [ 66.84667 ,  95.      ]], dtype=float32)

Wow?哇? Where does this huge evaluation loss come from?这个巨大的评价损失从何而来? And why the model keeps predicting the same stupid value?为什么 model 一直在预测同样的愚蠢值? Everything was so good during the training!训练期间一切都很好!

Actually, in a day or so I realized what was going on, but I offer to others a possibility to solve this charade and earn some points.实际上,在一天左右的时间里,我意识到发生了什么,但我向其他人提供了解决这个游戏并获得一些积分的可能性。

The problem was that a network reasonably functioning in the training mode failed to work in the inference mode.问题是在训练模式下正常运行的网络无法在推理模式下工作。 What might be the cause?可能是什么原因? There are basically two layer types working differently in the two modes: dropout and batch normalization.在两种模式下基本上有两种不同的层类型:dropout 和 batch normalization。 In MobileNet V. 2 , we have only batch normalization, so let's consider how it works.MobileNet V. 2中,我们只有批量标准化,所以让我们考虑一下它是如何工作的。

In the training mode a BN layer calculates batch mean and variance and normalizes the data using these batch values.在训练模式下,BN 层计算批均值和方差,并使用这些批值对数据进行归一化。 At the same time it remembers the mean and variance as a moving average weighted with a coefficient called momentum .同时,它会将均值和方差作为移动平均线记住,该移动平均线使用称为momentum的系数加权。

moving_mean = moving_mean * momentum + mean(batch) * (1 - momentum)
moving_var = moving_var * momentum + var(batch) * (1 - momentum)

Indeed, this momentum is an important hyperparameter, especially if true batch statistics are far from the initial values.事实上,这个momentum是一个重要的超参数,尤其是当真正的批量统计远离初始值时。 Suppose the initial variance value is 1.0 , the momentum is 0.99 (which is the default), and the true data variance is 0.1 .假设初始方差值为1.0 ,动量为0.99 (这是默认值),真实数据方差为0.1 Than the 10% error ( var < 0.11 ) can be achieved after 447 batches.比 10% 的误差( var < 0.11 )可以在 447 个批次后实现。

Now the root cause of the problem: in MobileNet all the numerous BN layers have momentum=0.999 , which means it will take 4497 batch steps to achieve the same 10% error, When you are training on a very large heterogeneous data set like ImageNet in small batches.现在问题的根本原因是:在MobileNet中,所有众多 BN 层的momentum=0.999 ,这意味着当您在像 ImageNet 这样的非常大的异构数据集上进行训练时,需要 4497 个批处理步骤才能达到相同的 10% 错误。小批量。 this is a 100% reasonable hyperparameter choice.这是一个 100% 合理的超参数选择。 But in this toy example the result is that the BN layers just fail to remember true data statistics during 400 batches and use completely wrong values during inference!但在这个玩具示例中,结果是 BN 层无法记住 400 个批次期间的真实数据统计信息,并且在推理期间使用了完全错误的值!

And the fix is very simple: just change the momenta before model.compile :修复非常简单:只需更改model.compile之前的动量:

for layer in model.layers[0].layers:
    if type(layer) is tf.keras.layers.BatchNormalization:
        layer.momentum = 0.9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM