CNN 模型的权重变为非常小的值，并且在 NaN 之后

Question

I am not able to understand the reason why the weights of following model are going smaller and smaller until NaN during training.我无法理解为什么在训练期间跟随模型的权重越来越小，直到NaN 。

The model is the following:模型如下：

def initialize_embedding_matrix(embedding_matrix):
    embedding_layer = Embedding(
        input_dim=embedding_matrix.shape[0],
        output_dim=embedding_matrix.shape[1],
        weights=[embedding_matrix],
        trainable=True)
    return embedding_layer

def get_divisor(x):
    return K.sqrt(K.sum(K.square(x), axis=-1))


def similarity(a, b):
    numerator = K.sum(a * b, axis=-1)
    denominator = get_divisor(a) * get_divisor(b)
    denominator = K.maximum(denominator, K.epsilon())
    return numerator / denominator


def max_margin_loss(positive, negative):
    loss_matrix = K.maximum(0.0, 1.0 + negative - Reshape((1,))(positive))
    loss = K.sum(loss_matrix, axis=-1, keepdims=True)
    return loss


def warp_loss(X):
    z, positive_entity, negatives_entities = X
    positiveSim = Lambda(lambda x: similarity(x[0], x[1]), output_shape=(1,), name="positive_sim")([z, positive_entity])
    z_reshaped = Reshape((1, z.shape[1].value))(z)
    negativeSim = Lambda(lambda x: similarity(x[0], x[1]), output_shape=(negatives_titles.shape[1].value, 1,), name="negative_sim")([z_reshaped, negatives_entities])
    loss = Lambda(lambda x: max_margin_loss(x[0], x[1]), output_shape=(1,), name="max_margin")([positiveSim, negativeSim])
    return loss

def mean_loss(y_true, y_pred):
    return K.mean(y_pred - 0 * y_true)

def build_nn_model():
    wl, tl = load_vector_lookups()
    embedded_layer_1 = initialize_embedding_matrix(wl)
    embedded_layer_2 = initialize_embedding_matrix(tl)

    sequence_input_1 = Input(shape=(_NUMBER_OF_LENGTH,), dtype='int32',name="text")
    sequence_input_positive = Input(shape=(1,), dtype='int32', name="positive")
    sequence_input_negatives = Input(shape=(10,), dtype='int32', name="negatives")

    embedded_sequences_1 = embedded_layer_1(sequence_input_1)
    embedded_sequences_positive = Reshape((tl.shape[1],))(embedded_layer_2(sequence_input_positive))
    embedded_sequences_negatives = embedded_layer_2(sequence_input_negatives)

    conv_step1 = Convolution1D(
        filters=1000,
        kernel_size=5,
        activation="tanh",
        name="conv_layer_mp",
        padding="valid")(embedded_sequences_1)

    conv_step2 = GlobalMaxPooling1D(name="max_pool_mp")(conv_step1)
    conv_step3 = Activation("tanh")(conv_step2)
    conv_step4 = Dropout(0.2, name="dropout_mp")(conv_step3)
    z = Dense(wl.shape[1], name="predicted_vec")(conv_step4) # activation="linear"

    loss = warp_loss([z, embedded_sequences_positive, embedded_sequences_negatives])
    model = Model(
        inputs=[sequence_input_1, sequence_input_positive, sequence_input_negatives],
        outputs=[loss]
        )
    model.compile(loss=mean_loss, optimizer=Adam())
    return model

model = build_nn_model()
x, y_real, y_fake = load_x_y()
    X_train = {
    'text': x_train,
    'positive': y_real_train,
    'negatives': y_fake_train
}

model.fit(x=X_train,  y=np.ones(len(x_train)), batch_size=10, shuffle=True, validation_split=0.1, epochs=10)

To describe the model a bit:稍微描述一下模型：

I have two pre-trained embeddings( wl , tl ) and I initialize the Keras embeddings with these values.我有两个预训练的嵌入（ wl ， tl ），我用这些值初始化 Keras 嵌入。
There are 3 inputs.有3个输入。 The sequence_input_1 has integers as input (indexes of words. ex. [42, 32 .., 4] ). sequence_input_1有整数作为输入（单词的索引。例如[42, 32 .., 4] ）。 On them sequence.pad_sequences(X, maxlen=_NUMBER_OF_LENGTH) is used to have fixed length.在它们上面sequence.pad_sequences(X, maxlen=_NUMBER_OF_LENGTH)用于固定长度。 sequence_input_positive which is an integer of the positive output and sequence_input_negatives which are N random negative outputs (10 in the code above) for each example. sequence_input_positive其是正输出和的整数sequence_input_negatives这对于每个实施例N个随机负输出端（10在上面的代码）。
max_margin_loss measures the difference between the cosinus_similarity(positive_example, sequence_input_1) and cosinus_similarity(negative_example[i], sequence_input_1) and the Adam optimizer is used to minimize loss. max_margin_loss 衡量cosinus_similarity(positive_example, sequence_input_1)和cosinus_similarity(negative_example[i], sequence_input_1)之间的差异cosinus_similarity(negative_example[i], sequence_input_1) Adam 优化器用于最小化损失。

While training this model even with only 20 data points the weights in the Convolution1D and Dense goes to NaN.在训练这个模型时，即使只有 20 个数据点， Convolution1D和Dense的权重也会变为 NaN。 If I add more data points the embedding weights go to NaN too.如果我添加更多数据点，嵌入权重也会变为 NaN。 I can observe that as the model runs the weights are going smaller and smaller until they go to NaN.我可以观察到，随着模型运行，权重越来越小，直到它们变为 NaN。 Something noticable also is that the loss does not go to NaN.还有一点值得注意的是，损失不会归于 NaN。 When weights reach NaN, the loss goes to zero.当权重达到 NaN 时，损失变为零。

I am unable to find what is going wrong.我无法找到出了什么问题。

This is what I tried until now:这是我迄今为止所尝试的：

I have seen that people are using stochastic gradient descent when hinge loss is used.我已经看到人们在使用铰链损失时使用随机梯度下降。 Using SGD optimizer didn't change something in the behavior here.使用SGD优化器并没有改变这里的行为。
changed the number of batch size.更改了批量大小的数量。 No change in behavior.行为没有变化。
checked input data not to have nan values.检查输入数据没有nan值。
normalized the input matrix (pre-trained data) for embedding with np.linalg.norm使用np.linalg.norm对输入矩阵（预训练数据）进行归一化嵌入
transform pre-trained matrix from float64 to float32将预训练矩阵从float64为float32

Do you see anything strange in the architecture of the model?你在模型的架构中看到什么奇怪的东西了吗？ If not: I am unable to find a way to debug the architecture in order to understand why weights are going smaller and smaller till reach NaN.如果不是：我无法找到调试架构的方法，以了解为什么权重会越来越小，直到达到 NaN。 Is there some steps people are using when they notice this kind of behaviour?当人们注意到这种行为时，他们是否正在使用某些步骤？

Edit :编辑：

By using trainable=False in the Embeddings this behaviour of nan weights is NOT observed, and the training seems to have smooth results.通过在嵌入中使用trainable=False不会观察到nan权重的这种行为，并且训练似乎具有平滑的结果。 However I want the embeddings to be trainable.但是我希望嵌入是可训练的。 So why this behavior when the embeddings are trainable??那么当嵌入可训练时为什么会出现这种行为？

Edit2 :编辑2 ：

Using trainable=True and by uniformly randomly initializing the weights embeddings_initializer='uniform' the training is smooth.使用trainable=True并通过统一随机初始化权重embeddings_initializer='uniform'训练是平滑的。 So the reason happening is my word embeddings.所以发生的原因是我的词嵌入。 I have checked my pre-trained word embeddings and there are no NaN values.我检查了我的预训练词嵌入，没有NaN值。 I have also normalized them in case this was causing it but no lack.我也对它们进行了标准化，以防万一这是造成的，但并不缺乏。 Cant think anything else why these specific weights are giving this behaviour.想不出为什么这些特定的权重会产生这种行为。

Edit3 :编辑3 ：

It seems that what causing this was that a lot of rows from one of the Embeddings trained in gensim where all zeros.似乎导致这种情况的原因是在 gensim 中训练的其中一个嵌入的很多行都为零。 ex.前任。

[0.2, 0.1, .. 0.3],
[0.0, 0.0, .. 0.0],
[0.0, 0.0, .. 0.0],
[0.0, 0.0, .. 0.0],
[0.2, 0.1, .. 0.1]

It was not so easy to find it as the dimension of the embeddings where really big.找到它并不容易，因为嵌入的维度非常大。

I am leaving this question open in case someone comes up with something similar or wants to answer the question asked above: "Is there some steps people are using when they notice this kind of behaviour?"如果有人提出类似的问题或想回答上面提出的问题，我将这个问题悬而未决：“人们在注意到这种行为时是否正在使用某些步骤？”

Answer 1

By your edits, it got a little easier to find the problem.通过您的编辑，发现问题变得更容易了。

Those zeros passed unchanged to the warp_loss function.这些零不变地传递给warp_loss函数。 The part that went through the convolution remained unchanged at first, because any filters multiplied by zero result in zero, and the default bias initializer is also 'zeros' .经过卷积的部分最初保持不变，因为任何乘以零的过滤器都会导致零，并且默认偏差初始化器也是'zeros' 。 The same idea applies to the dense (filters * 0 = 0 and bias initializer = 'zeros')同样的想法适用于密集（过滤器 * 0 = 0 和偏差初始值设定项 = 'zeros'）

That reached this line: return numerator / denominator and caused an error (division by zero)到达这一行： return numerator / denominator并导致错误（除以零）

It's a common practice I've seen in many codes to add K.epsilon() to avoid this:我在许多代码中看到添加K.epsilon()以避免这种情况的常见做法：

return numerator / (denominator + K.epsilon())

CNN 模型的权重变为非常小的值，并且在 NaN 之后

问题描述

1 个解决方案

解决方案1
1 2017-09-28 22:06:48

CNN 模型的权重变为非常小的值，并且在 NaN 之后

问题描述

1 个解决方案

解决方案1 1 2017-09-28 22:06:48

解决方案1
1 2017-09-28 22:06:48