简体   繁体   English

使用 Tf.math.is_nan 函数训练期间的 Tensorflow NaN 损失

[英]Tensorflow NaN loss during training with Tf.math.is_nan function

I have written a custom loss function that returns a loss of 0 when the ground truth labels (6d vector) are NaN and otherwise returns the mean squared error.我编写了一个自定义损失函数,当真实标签(6d 向量)为 NaN 时返回损失 0,否则返回均方误差。 Either all 6 features in the label are NaN, or there are no NaNs.标签中的所有 6 个特征都是 NaN,或者没有 NaN。

my loss function looks like:我的损失函数看起来像:

tf.reduce_mean(tf.where(tf.math.is_nan(true_labels), tf.zeros_like(true_labels),
tf.square(tf.subtract(true_labels, predicted_labels))))

where true_labels and predicted_labels have shape (batch_size, 6), and only entire rows of either matrix can be NaN.其中true_labels 和predicted_labels 的形状为(batch_size, 6),并且只有任一矩阵的整行都可以是NaN。 I get NaN loss values in this case, even though I should be returning 0 for the loss when thr ground truth is NaN.在这种情况下,我得到 NaN 损失值,即使当 thr ground truth 是 NaN 时我应该为损失返回 0。 I have also tested this issue with a work around by replacing all the NaN values with a large negative number (-1e4, which is outside the range of my data) during preprocessing, and then testing for NaNs in my loss function by using我还通过在预处理期间将所有 NaN 值替换为一个大的负数(-1e4,超出了我的数据范围)来测试这个问题,然后通过使用测试我的损失函数中的 NaN

tf.where(tf.math.less(true_labels, -9999), tf.zeros_like(true_labels),
tf.square(tf.subtract(true_labels, predicted_labels)))

This is a total hack, but works nonetheless.这是一个完全的黑客,但仍然有效。 Therefore, I believe the issue is with the tf.math.is_nan function, but I have no idea why it gives my NaN losses.因此,我认为问题出在 tf.math.is_nan 函数上,但我不知道为什么它会导致我的 NaN 损失。 Furthermore, I have tested the loss function outside of training mode on some labels I made artificially, and it does not return NaNs then.此外,我已经在我人工制作的一些标签上测试了训练模式之外的损失函数,然后它不会返回 NaN。 Any help is appreciated.任何帮助表示赞赏。

This is my model below.这是我下面的模型。 It returns a (batch_size, 6) shaped Tensor.它返回一个 (batch_size, 6) 形状的张量。 The first column is sigmoid activated to lie in [0,1] and is fed into a binary cross entropy loss function (that I did not include here, but confirmed that the NaN is not coming from the binary loss).第一列被激活为位于 [0,1] 的 sigmoid 并被输入到二元交叉熵损失函数中(我在这里没有包括,但确认 NaN 不是来自二元损失)。 The remaining 5 columns are fed into the custom loss function defined above.剩下的 5 列被输入到上面定义的自定义损失函数中。

def custom_activation(tensor):
    first_node_sigmoid = tf.nn.sigmoid(tensor[:, :1])
    return tf.concat([first_node_sigmoid, tensor[:, 1:]], axis = 1)


def gen_model():
    IMAGE_SIZE = 200
    CONV_PARAMS = {"kernel_size": 3, "use_bias": False, "padding": "same"}
    CONV_PARAMS2 = {"kernel_size": 5, "use_bias": False, "padding": "same"}

    model = Sequential()
    model.add(
        Reshape((IMAGE_SIZE, IMAGE_SIZE, 1), input_shape=(IMAGE_SIZE, IMAGE_SIZE))
    )
    model.add(Conv2D(16, **CONV_PARAMS))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Conv2D(32, **CONV_PARAMS))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Conv2D(64, **CONV_PARAMS))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Conv2D(64, **CONV_PARAMS))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Conv2D(64, **CONV_PARAMS2))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Conv2D(128, **CONV_PARAMS2))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Conv2D(128, **CONV_PARAMS2))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Flatten())
    model.add(Dense(64))
    model.add(Dense(6))
    model.add(tf.keras.layers.Lambda(custom_activation, name = "final_activation_layer"))
    return model

Here is an example of what the ground truth label looks like when the first feature is True (1):以下是第一个特征为 True (1) 时真实标签的外观示例:

 [  1.         106.         189.           2.64826314  19.
   26.44962941]

When the first feature is False (0), the label is当第一个特征为False(0)时,标签为

[0, nan, nan, nan, nan, nan]

Edit: Added details of model and label examples编辑:添加了模型和标签示例的详细信息

Update:更新:

After some debugging with tf.print statements, I found that my 'predicted_labels' are coming out as all NaN values.在使用 tf.print 语句进行一些调试后,我发现我的“predicted_labels”作为所有 NaN 值出现。 This issue does not occur when I use the 'hack' described above, so I don't think it is an issue wiht my data.当我使用上面描述的“hack”时不会发生这个问题,所以我认为这不是我的数据的问题。 I also checked that none of my images contain any NaNs after preprocessing when used as input to the network.我还检查了我的图像在用作网络输入时在预处理后不包含任何 NaN。 Somehow, with the loss function described above, I get NaNs in my predicted values, but I have no idea why.不知何故,使用上述损失函数,我的预测值得到了 NaN,但我不知道为什么。 I have tried lowering learning rate and batch size, but this does not help.我曾尝试降低学习率和批量大小,但这无济于事。

Maybe something like the following could work for you.也许像下面这样的东西对你有用。 All nan elements are first converted to 0, while the rest remain elements stay the same.所有nan元素首先转换为 0,而其余元素保持不变。 For example, [0, np.nan, np.nan, np.nan, np.nan, np.nan] results in [0, 0, 0, 0, 0, 0] while [1., 106., 189., 2.64826314, 19., 26.44962941] remains untouched.例如, [0, np.nan, np.nan, np.nan, np.nan, np.nan]结果为[0, 0, 0, 0, 0, 0][1., 106., 189., 2.64826314, 19., 26.44962941]保持不变。 Afterwards, your loss is only calculated for non-zero values.之后,您的损失仅针对非零值计算。 If true_labels are zero, then you just return 0.如果true_labels为零,那么您只需返回 0。

import tensorflow as tf
import numpy as np

def custom_loss(true_labels, predicted_labels):

   true_labels = tf.where(tf.math.is_nan(true_labels), tf.zeros_like(true_labels), true_labels)
   loss = tf.reduce_mean(
       tf.where(tf.equal(true_labels, 0.0), true_labels,
       tf.square(tf.subtract(true_labels, predicted_labels))))
   return loss

def custom_activation(tensor):
    first_node_sigmoid = tf.nn.sigmoid(tensor[:, :1])
    return tf.concat([first_node_sigmoid, tensor[:, 1:]], axis = 1)


def gen_model():
    IMAGE_SIZE = 200
    CONV_PARAMS = {"kernel_size": 3, "use_bias": False, "padding": "same"}
    CONV_PARAMS2 = {"kernel_size": 5, "use_bias": False, "padding": "same"}

    model = tf.keras.Sequential()
    model.add(
        tf.keras.layers.Reshape((IMAGE_SIZE, IMAGE_SIZE, 1), input_shape=(IMAGE_SIZE, IMAGE_SIZE))
    )
    model.add(tf.keras.layers.Conv2D(16, **CONV_PARAMS))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Conv2D(32, **CONV_PARAMS))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Conv2D(64, **CONV_PARAMS))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.Conv2D(64, **CONV_PARAMS))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.Conv2D(64, **CONV_PARAMS2))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Conv2D(128, **CONV_PARAMS2))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Conv2D(128, **CONV_PARAMS2))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(64))
    model.add(tf.keras.layers.Dense(6))
    model.add(tf.keras.layers.Lambda(custom_activation, name = "final_activation_layer"))
    return model

Y_train = tf.constant([[1., 106., 189., 2.64826314, 19., 26.44962941], 
                       [0, np.nan, np.nan, np.nan, np.nan, np.nan]])
model = gen_model()
model.compile(loss=custom_loss, optimizer=tf.keras.optimizers.Adam())
model.fit(tf.random.normal((2, 200, 200)), Y_train, epochs=4)
Epoch 1/4
1/1 [==============================] - 1s 1s/step - loss: 4112.9380
Epoch 2/4
1/1 [==============================] - 0s 30ms/step - loss: 947.3030
Epoch 3/4
1/1 [==============================] - 0s 25ms/step - loss: 25.8993
Epoch 4/4
1/1 [==============================] - 0s 24ms/step - loss: 217.2151
<keras.callbacks.History at 0x7f8490b8db90>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM