Keras：将mask_zero与带填充序列相比使用单序列非带填充训练

Question

I'm building an LSTM model in Keras to classify entities from sentences. 我正在Keras中构建LSTM模型，以根据句子对实体进行分类。 I'm experimenting with both zero padded sequences and the mask_zero parameter, or a generator to train the model on one sentence (or batches of same length sentences) at a time so I don't need to pad them with zeros. 我正在尝试填充零的序列和mask_zero参数，或者尝试使用生成器一次在一个句子（或成批相同长度的句子）上训练模型，因此不需要在它们上填充零。

If I define my model as such: 如果我这样定义我的模型：

model = Sequential()
model.add(Embedding(input_dim=vocab_size+1, output_dim=200, mask_zero=True,
                    weights=[pretrained_weights], trainable = True))
model.add(Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(target_size, activation='softmax')))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics = ['accuracy'])

Can I expect the padded sequences with the mask_zero parameter to perform similarly to feeding the model non-padded sequences one sentence at a time? 我是否可以期望带有mask_zero参数的填充序列的性能类似于一次填充模型非填充序列的一句话？ Essentially: 实质上：

model.fit(padded_x, padded_y, batch_size=128, epochs=n_epochs,
 validation_split=0.1, verbose=1)

or 要么

def iter_sentences():
        while True:
            for i in range(len(train_x)):
                yield np.array([train_x[i]]), to_categorical([train_y[i]], num_classes = target_size)

model.fit_generator(iter_sentences(), steps_per_epoch=less_steps, epochs=way_more_epochs, verbose=1)

I'm just not sure if there is a general preference for one method over the other, or the exact effect the mask_zero parameter has on the model. 我只是不确定是否有一种方法优于另一种方法的一般偏好，或者mask_zero参数对模型的确切影响。

Note: There are slight parameter differences for the model initialization based on which training method I'm using - I've left those out for brevity. 注意：根据我使用的训练方法，模型初始化会有一些细微的参数差异-为了简洁起见，我省略了这些参数。

Answer 1

The biggest difference will be performance and training stability , otherwise padding and then masking is the same as processing single sentence at time. 最大的区别在于性能和训练稳定性 ，否则填充和掩蔽与一次处理单个句子相同。

performance : Well you will train one point at a time which might not exploit any parallelism that is available on the hardware. 性能：好吧，您一次只能训练一个点，这可能不会利用硬件上可用的任何并行性。 Often, we adjust the batch size to get the best performance from the machine during training and prediction. 通常，我们会调整批次大小，以便在训练和预测期间从机器获得最佳性能。
training stability : when you set batch size to 1 you are not longer performing mini-batch training. 训练稳定性 ：将批次大小设置为1时，您将不再执行小批量训练。 The training routine will apply updates after every data point which might be detrimental for momentum based algorithms such as Adam. 训练例程将在每个数据点之后应用更新，这可能对基于动量的算法（例如Adam）不利。 Instead, accumulating gradients over a batch tends to provide more stable convergence especially if the data is noisy. 取而代之的是，累积批次上的梯度会提供更稳定的收敛性，尤其是在数据嘈杂的情况下。

So to answer the question, no, you can't expect them to perform similarly. 因此，要回答这个问题，不，您不能期望它们具有类似的性能。

Keras：将mask_zero与带填充序列相比使用单序列非带填充训练

问题描述

1 个解决方案

解决方案1
4 已采纳 2018-06-18 15:04:27

Keras：将mask_zero与带填充序列相比使用单序列非带填充训练

问题描述

1 个解决方案

解决方案1 4 已采纳 2018-06-18 15:04:27

解决方案1
4 已采纳 2018-06-18 15:04:27