使用 Tf.math.is_nan 函数训练期间的 Tensorflow NaN 损失

Question

I have written a custom loss function that returns a loss of 0 when the ground truth labels (6d vector) are NaN and otherwise returns the mean squared error.我编写了一个自定义损失函数，当真实标签（6d 向量）为 NaN 时返回损失 0，否则返回均方误差。 Either all 6 features in the label are NaN, or there are no NaNs.标签中的所有 6 个特征都是 NaN，或者没有 NaN。

my loss function looks like:我的损失函数看起来像：

tf.reduce_mean(tf.where(tf.math.is_nan(true_labels), tf.zeros_like(true_labels),
tf.square(tf.subtract(true_labels, predicted_labels))))

where true_labels and predicted_labels have shape (batch_size, 6), and only entire rows of either matrix can be NaN.其中true_labels 和predicted_labels 的形状为(batch_size, 6)，并且只有任一矩阵的整行都可以是NaN。 I get NaN loss values in this case, even though I should be returning 0 for the loss when thr ground truth is NaN.在这种情况下，我得到 NaN 损失值，即使当 thr ground truth 是 NaN 时我应该为损失返回 0。 I have also tested this issue with a work around by replacing all the NaN values with a large negative number (-1e4, which is outside the range of my data) during preprocessing, and then testing for NaNs in my loss function by using我还通过在预处理期间将所有 NaN 值替换为一个大的负数（-1e4，超出了我的数据范围）来测试这个问题，然后通过使用测试我的损失函数中的 NaN

tf.where(tf.math.less(true_labels, -9999), tf.zeros_like(true_labels),
tf.square(tf.subtract(true_labels, predicted_labels)))

This is a total hack, but works nonetheless.这是一个完全的黑客，但仍然有效。 Therefore, I believe the issue is with the tf.math.is_nan function, but I have no idea why it gives my NaN losses.因此，我认为问题出在 tf.math.is_nan 函数上，但我不知道为什么它会导致我的 NaN 损失。 Furthermore, I have tested the loss function outside of training mode on some labels I made artificially, and it does not return NaNs then.此外，我已经在我人工制作的一些标签上测试了训练模式之外的损失函数，然后它不会返回 NaN。 Any help is appreciated.任何帮助表示赞赏。

This is my model below.这是我下面的模型。 It returns a (batch_size, 6) shaped Tensor.它返回一个 (batch_size, 6) 形状的张量。 The first column is sigmoid activated to lie in [0,1] and is fed into a binary cross entropy loss function (that I did not include here, but confirmed that the NaN is not coming from the binary loss).第一列被激活为位于 [0,1] 的 sigmoid 并被输入到二元交叉熵损失函数中（我在这里没有包括，但确认 NaN 不是来自二元损失）。 The remaining 5 columns are fed into the custom loss function defined above.剩下的 5 列被输入到上面定义的自定义损失函数中。

def custom_activation(tensor):
    first_node_sigmoid = tf.nn.sigmoid(tensor[:, :1])
    return tf.concat([first_node_sigmoid, tensor[:, 1:]], axis = 1)


def gen_model():
    IMAGE_SIZE = 200
    CONV_PARAMS = {"kernel_size": 3, "use_bias": False, "padding": "same"}
    CONV_PARAMS2 = {"kernel_size": 5, "use_bias": False, "padding": "same"}

    model = Sequential()
    model.add(
        Reshape((IMAGE_SIZE, IMAGE_SIZE, 1), input_shape=(IMAGE_SIZE, IMAGE_SIZE))
    )
    model.add(Conv2D(16, **CONV_PARAMS))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Conv2D(32, **CONV_PARAMS))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Conv2D(64, **CONV_PARAMS))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Conv2D(64, **CONV_PARAMS))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Conv2D(64, **CONV_PARAMS2))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Conv2D(128, **CONV_PARAMS2))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Conv2D(128, **CONV_PARAMS2))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Flatten())
    model.add(Dense(64))
    model.add(Dense(6))
    model.add(tf.keras.layers.Lambda(custom_activation, name = "final_activation_layer"))
    return model

Here is an example of what the ground truth label looks like when the first feature is True (1):以下是第一个特征为 True (1) 时真实标签的外观示例：

 [  1.         106.         189.           2.64826314  19.
   26.44962941]

When the first feature is False (0), the label is当第一个特征为False（0）时，标签为

[0, nan, nan, nan, nan, nan]

Edit: Added details of model and label examples编辑：添加了模型和标签示例的详细信息

Update:更新：

After some debugging with tf.print statements, I found that my 'predicted_labels' are coming out as all NaN values.在使用 tf.print 语句进行一些调试后，我发现我的“predicted_labels”作为所有 NaN 值出现。 This issue does not occur when I use the 'hack' described above, so I don't think it is an issue wiht my data.当我使用上面描述的“hack”时不会发生这个问题，所以我认为这不是我的数据的问题。 I also checked that none of my images contain any NaNs after preprocessing when used as input to the network.我还检查了我的图像在用作网络输入时在预处理后不包含任何 NaN。 Somehow, with the loss function described above, I get NaNs in my predicted values, but I have no idea why.不知何故，使用上述损失函数，我的预测值得到了 NaN，但我不知道为什么。 I have tried lowering learning rate and batch size, but this does not help.我曾尝试降低学习率和批量大小，但这无济于事。

Answer 1

Maybe something like the following could work for you.也许像下面这样的东西对你有用。 All nan elements are first converted to 0, while the rest remain elements stay the same.所有nan元素首先转换为 0，而其余元素保持不变。 For example, [0, np.nan, np.nan, np.nan, np.nan, np.nan] results in [0, 0, 0, 0, 0, 0] while [1., 106., 189., 2.64826314, 19., 26.44962941] remains untouched.例如， [0, np.nan, np.nan, np.nan, np.nan, np.nan]结果为[0, 0, 0, 0, 0, 0]而[1., 106., 189., 2.64826314, 19., 26.44962941]保持不变。 Afterwards, your loss is only calculated for non-zero values.之后，您的损失仅针对非零值计算。 If true_labels are zero, then you just return 0.如果true_labels为零，那么您只需返回 0。

import tensorflow as tf
import numpy as np

def custom_loss(true_labels, predicted_labels):

   true_labels = tf.where(tf.math.is_nan(true_labels), tf.zeros_like(true_labels), true_labels)
   loss = tf.reduce_mean(
       tf.where(tf.equal(true_labels, 0.0), true_labels,
       tf.square(tf.subtract(true_labels, predicted_labels))))
   return loss

def custom_activation(tensor):
    first_node_sigmoid = tf.nn.sigmoid(tensor[:, :1])
    return tf.concat([first_node_sigmoid, tensor[:, 1:]], axis = 1)


def gen_model():
    IMAGE_SIZE = 200
    CONV_PARAMS = {"kernel_size": 3, "use_bias": False, "padding": "same"}
    CONV_PARAMS2 = {"kernel_size": 5, "use_bias": False, "padding": "same"}

    model = tf.keras.Sequential()
    model.add(
        tf.keras.layers.Reshape((IMAGE_SIZE, IMAGE_SIZE, 1), input_shape=(IMAGE_SIZE, IMAGE_SIZE))
    )
    model.add(tf.keras.layers.Conv2D(16, **CONV_PARAMS))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Conv2D(32, **CONV_PARAMS))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Conv2D(64, **CONV_PARAMS))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.Conv2D(64, **CONV_PARAMS))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.Conv2D(64, **CONV_PARAMS2))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Conv2D(128, **CONV_PARAMS2))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Conv2D(128, **CONV_PARAMS2))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(64))
    model.add(tf.keras.layers.Dense(6))
    model.add(tf.keras.layers.Lambda(custom_activation, name = "final_activation_layer"))
    return model

Y_train = tf.constant([[1., 106., 189., 2.64826314, 19., 26.44962941], 
                       [0, np.nan, np.nan, np.nan, np.nan, np.nan]])
model = gen_model()
model.compile(loss=custom_loss, optimizer=tf.keras.optimizers.Adam())
model.fit(tf.random.normal((2, 200, 200)), Y_train, epochs=4)

Epoch 1/4
1/1 [==============================] - 1s 1s/step - loss: 4112.9380
Epoch 2/4
1/1 [==============================] - 0s 30ms/step - loss: 947.3030
Epoch 3/4
1/1 [==============================] - 0s 25ms/step - loss: 25.8993
Epoch 4/4
1/1 [==============================] - 0s 24ms/step - loss: 217.2151
<keras.callbacks.History at 0x7f8490b8db90>

使用 Tf.math.is_nan 函数训练期间的 Tensorflow NaN 损失

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-11-17 07:36:20

使用 Tf.math.is_nan 函数训练期间的 Tensorflow NaN 损失

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-11-17 07:36:20

解决方案1
1 已采纳 2021-11-17 07:36:20