[英]Weights of CNN model go to really small values and after NaN
I am not able to understand the reason why the weights of following model are going smaller and smaller until NaN
during training.我无法理解为什么在训练期间跟随模型的权重越来越小,直到
NaN
。
The model is the following:模型如下:
def initialize_embedding_matrix(embedding_matrix):
embedding_layer = Embedding(
input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
weights=[embedding_matrix],
trainable=True)
return embedding_layer
def get_divisor(x):
return K.sqrt(K.sum(K.square(x), axis=-1))
def similarity(a, b):
numerator = K.sum(a * b, axis=-1)
denominator = get_divisor(a) * get_divisor(b)
denominator = K.maximum(denominator, K.epsilon())
return numerator / denominator
def max_margin_loss(positive, negative):
loss_matrix = K.maximum(0.0, 1.0 + negative - Reshape((1,))(positive))
loss = K.sum(loss_matrix, axis=-1, keepdims=True)
return loss
def warp_loss(X):
z, positive_entity, negatives_entities = X
positiveSim = Lambda(lambda x: similarity(x[0], x[1]), output_shape=(1,), name="positive_sim")([z, positive_entity])
z_reshaped = Reshape((1, z.shape[1].value))(z)
negativeSim = Lambda(lambda x: similarity(x[0], x[1]), output_shape=(negatives_titles.shape[1].value, 1,), name="negative_sim")([z_reshaped, negatives_entities])
loss = Lambda(lambda x: max_margin_loss(x[0], x[1]), output_shape=(1,), name="max_margin")([positiveSim, negativeSim])
return loss
def mean_loss(y_true, y_pred):
return K.mean(y_pred - 0 * y_true)
def build_nn_model():
wl, tl = load_vector_lookups()
embedded_layer_1 = initialize_embedding_matrix(wl)
embedded_layer_2 = initialize_embedding_matrix(tl)
sequence_input_1 = Input(shape=(_NUMBER_OF_LENGTH,), dtype='int32',name="text")
sequence_input_positive = Input(shape=(1,), dtype='int32', name="positive")
sequence_input_negatives = Input(shape=(10,), dtype='int32', name="negatives")
embedded_sequences_1 = embedded_layer_1(sequence_input_1)
embedded_sequences_positive = Reshape((tl.shape[1],))(embedded_layer_2(sequence_input_positive))
embedded_sequences_negatives = embedded_layer_2(sequence_input_negatives)
conv_step1 = Convolution1D(
filters=1000,
kernel_size=5,
activation="tanh",
name="conv_layer_mp",
padding="valid")(embedded_sequences_1)
conv_step2 = GlobalMaxPooling1D(name="max_pool_mp")(conv_step1)
conv_step3 = Activation("tanh")(conv_step2)
conv_step4 = Dropout(0.2, name="dropout_mp")(conv_step3)
z = Dense(wl.shape[1], name="predicted_vec")(conv_step4) # activation="linear"
loss = warp_loss([z, embedded_sequences_positive, embedded_sequences_negatives])
model = Model(
inputs=[sequence_input_1, sequence_input_positive, sequence_input_negatives],
outputs=[loss]
)
model.compile(loss=mean_loss, optimizer=Adam())
return model
model = build_nn_model()
x, y_real, y_fake = load_x_y()
X_train = {
'text': x_train,
'positive': y_real_train,
'negatives': y_fake_train
}
model.fit(x=X_train, y=np.ones(len(x_train)), batch_size=10, shuffle=True, validation_split=0.1, epochs=10)
To describe the model a bit:稍微描述一下模型:
wl
, tl
) and I initialize the Keras embeddings with these values.wl
, tl
),我用这些值初始化 Keras 嵌入。sequence_input_1
has integers as input (indexes of words. ex. [42, 32 .., 4]
). sequence_input_1
有整数作为输入(单词的索引。例如[42, 32 .., 4]
)。 On them sequence.pad_sequences(X, maxlen=_NUMBER_OF_LENGTH)
is used to have fixed length.sequence.pad_sequences(X, maxlen=_NUMBER_OF_LENGTH)
用于固定长度。 sequence_input_positive
which is an integer of the positive output and sequence_input_negatives
which are N random negative outputs (10 in the code above) for each example. sequence_input_positive
其是正输出和的整数sequence_input_negatives
这对于每个实施例N个随机负输出端(10在上面的代码)。cosinus_similarity(positive_example, sequence_input_1)
and cosinus_similarity(negative_example[i], sequence_input_1)
and the Adam optimizer is used to minimize loss. cosinus_similarity(positive_example, sequence_input_1)
和cosinus_similarity(negative_example[i], sequence_input_1)
之间的差异cosinus_similarity(negative_example[i], sequence_input_1)
Adam 优化器用于最小化损失。 While training this model even with only 20 data points the weights in the Convolution1D
and Dense
goes to NaN.在训练这个模型时,即使只有 20 个数据点,
Convolution1D
和Dense
的权重也会变为 NaN。 If I add more data points the embedding weights go to NaN too.如果我添加更多数据点,嵌入权重也会变为 NaN。 I can observe that as the model runs the weights are going smaller and smaller until they go to NaN.
我可以观察到,随着模型运行,权重越来越小,直到它们变为 NaN。 Something noticable also is that the loss does not go to NaN.
还有一点值得注意的是,损失不会归于 NaN。 When weights reach NaN, the loss goes to zero.
当权重达到 NaN 时,损失变为零。
I am unable to find what is going wrong.我无法找到出了什么问题。
This is what I tried until now:这是我迄今为止所尝试的:
SGD
optimizer didn't change something in the behavior here.SGD
优化器并没有改变这里的行为。nan
values.nan
值。np.linalg.norm
np.linalg.norm
对输入矩阵(预训练数据)进行归一化嵌入float64
to float32
float64
为float32
Do you see anything strange in the architecture of the model?你在模型的架构中看到什么奇怪的东西了吗? If not: I am unable to find a way to debug the architecture in order to understand why weights are going smaller and smaller till reach NaN.
如果不是:我无法找到调试架构的方法,以了解为什么权重会越来越小,直到达到 NaN。 Is there some steps people are using when they notice this kind of behaviour?
当人们注意到这种行为时,他们是否正在使用某些步骤?
Edit :编辑:
By using trainable=False
in the Embeddings this behaviour of nan
weights is NOT observed, and the training seems to have smooth results.通过在嵌入中使用
trainable=False
不会观察到nan
权重的这种行为,并且训练似乎具有平滑的结果。 However I want the embeddings to be trainable.但是我希望嵌入是可训练的。 So why this behavior when the embeddings are trainable??
那么当嵌入可训练时为什么会出现这种行为?
Edit2 :编辑2 :
Using trainable=True
and by uniformly randomly initializing the weights embeddings_initializer='uniform'
the training is smooth.使用
trainable=True
并通过统一随机初始化权重embeddings_initializer='uniform'
训练是平滑的。 So the reason happening is my word embeddings.所以发生的原因是我的词嵌入。 I have checked my pre-trained word embeddings and there are no
NaN
values.我检查了我的预训练词嵌入,没有
NaN
值。 I have also normalized them in case this was causing it but no lack.我也对它们进行了标准化,以防万一这是造成的,但并不缺乏。 Cant think anything else why these specific weights are giving this behaviour.
想不出为什么这些特定的权重会产生这种行为。
Edit3 :编辑3 :
It seems that what causing this was that a lot of rows from one of the Embeddings trained in gensim where all zeros.似乎导致这种情况的原因是在 gensim 中训练的其中一个嵌入的很多行都为零。 ex.
前任。
[0.2, 0.1, .. 0.3],
[0.0, 0.0, .. 0.0],
[0.0, 0.0, .. 0.0],
[0.0, 0.0, .. 0.0],
[0.2, 0.1, .. 0.1]
It was not so easy to find it as the dimension of the embeddings where really big.找到它并不容易,因为嵌入的维度非常大。
I am leaving this question open in case someone comes up with something similar or wants to answer the question asked above: "Is there some steps people are using when they notice this kind of behaviour?"如果有人提出类似的问题或想回答上面提出的问题,我将这个问题悬而未决:“人们在注意到这种行为时是否正在使用某些步骤?”
By your edits, it got a little easier to find the problem.通过您的编辑,发现问题变得更容易了。
Those zeros passed unchanged to the warp_loss
function.这些零不变地传递给
warp_loss
函数。 The part that went through the convolution remained unchanged at first, because any filters multiplied by zero result in zero, and the default bias initializer is also 'zeros'
.经过卷积的部分最初保持不变,因为任何乘以零的过滤器都会导致零,并且默认偏差初始化器也是
'zeros'
。 The same idea applies to the dense (filters * 0 = 0 and bias initializer = 'zeros')同样的想法适用于密集(过滤器 * 0 = 0 和偏差初始值设定项 = 'zeros')
That reached this line: return numerator / denominator
and caused an error (division by zero)到达这一行:
return numerator / denominator
并导致错误(除以零)
It's a common practice I've seen in many codes to add K.epsilon()
to avoid this:我在许多代码中看到添加
K.epsilon()
以避免这种情况的常见做法:
return numerator / (denominator + K.epsilon())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.