Keras：MLP 的结果很好，但双向 LSTM 的结果很差

Question

I trained two neural networks with Keras : a MLP and a Bidirectional LSTM .我用Keras训练了两个神经网络：一个MLP和一个Bidirectional LSTM 。

My task is to predict the words order in a sentence, so for each word, the neural network has to output a real number.我的任务是预测一个句子中的单词顺序，所以对于每个单词，神经网络都需要 output 一个实数。 When a sentence with N words is processed, the N reals number in the output are ranked in order to obtain integer numbers representing words position.当处理一个有N个单词的句子时，对output中的N个实数进行排序，得到integer个数代表单词position。

I'm using same dataset and same preprocessing on the dataset.我在数据集上使用相同的数据集和相同的预处理。 The only different thing is that in the LSTM dataset I added padding to get the sequences of the same length.唯一不同的是，在LSTM数据集中，我添加了填充以获得相同长度的序列。

In the prediction phase, with LSTM , I exclude the predictions created from padding vectors, since I masked them in the training phase.在预测阶段，使用LSTM ，我排除了从填充向量创建的预测，因为我在训练阶段屏蔽了它们。

MLP architecture: MLP架构：

mlp = keras.models.Sequential()

# add input layer
mlp.add(
    keras.layers.Dense(
        units=training_dataset.shape[1],
        input_shape = (training_dataset.shape[1],),
        kernel_initializer=keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None),
        activation='relu')
    )

# add hidden layer
mlp.add(
    keras.layers.Dense(
        units=training_dataset.shape[1] + 10,
        input_shape = (training_dataset.shape[1] + 10,),
        kernel_initializer=keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None),
        bias_initializer='zeros',
        activation='relu')
    )

# add output layer
mlp.add(
    keras.layers.Dense(
        units=1,
        input_shape = (1, ),
        kernel_initializer=keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None),
        bias_initializer='zeros',
        activation='linear')
    )

Bidirection LSTM architecture:双向 LSTM 架构：

model = tf.keras.Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(Bidirectional(LSTM(units=20, return_sequences=True), input_shape=(timesteps, features)))
model.add(Dropout(0.2))
model.add(Dense(1, activation='linear'))

The task is much better solvable with an LSTM, which should capture dependencies between words well.使用 LSTM 可以更好地解决该任务，它应该可以很好地捕获单词之间的依赖关系。

However, with the MLP I achieve good results, but with LSTM the results are very bad.然而，使用MLP我取得了很好的结果，但使用LSTM的结果非常糟糕。

Since I'm a beginner, could someone understand what is wrong with my LSTM architecture?由于我是初学者，有人能理解我的LSTM架构有什么问题吗？ I'm going out of head.我快疯了。

Thanks in advance.提前致谢。

Answer 1

For this problem, I am actually not surprised that MLP performs better.对于这个问题，我对 MLP 表现更好其实并不感到惊讶。

The architecture of LSTM, bi-directional or not, assumes that location is very important to the structure. LSTM 的体系结构，无论是否是双向的，都假设位置对结构非常重要。 Words next to each other are more likely to be related than words farther away.彼此相邻的词比更远的词更有可能相关。

But for your problem you have removed the locality and are trying to restore it.但是对于您的问题，您已经删除了该位置并正在尝试恢复它。 For that problem, an MLP which has global information can do a better job at the sorting.对于这个问题，具有全局信息的 MLP 可以在排序方面做得更好。

That said, I think there is still something to be done to improve the LSTM model.也就是说，我认为还有一些工作要做来改进 LSTM model。

One thing you can do is ensure that the complexity of each model is similar.您可以做的一件事是确保每个 model 的复杂性相似。 You can do this easily with count_params .您可以使用count_params轻松完成此操作。

mlp.count_params()
model.count_params()

If I had to guess, your LSTM is much smaller.如果我不得不猜测，你的 LSTM 要小得多。 There are only 20 units , which seems small for an NLP problem.只有 20 个units ，对于 NLP 问题来说似乎很小。 I used 512 for a Product Classification problem to process character-level information (vocabulary of size 128, embedding of size 50).我将512用于产品分类问题来处理字符级信息（大小为 128 的词汇，大小为 50 的嵌入）。 Word-level models trained on bigger data sets, like AWD-LSTM , get into the thousands of units .在更大的数据集（如AWD-LSTM ）上训练的词级模型可以进入数千个units 。

So you probably want to increase that number.所以你可能想增加这个数字。 You can get an apples-to-apples comparison between the two models by increasing the number of units in the LSTM until the parameter counts are similar.您可以通过增加 LSTM 中的units数直到参数计数相似，从而在两个模型之间进行比较。 But you don't have to stop there, you can keep increasing the size until you start to overfit or your training starts taking too long.但是你不必止步于此，你可以继续增加尺寸，直到你开始过度拟合或者你的训练开始花费太长时间。

Keras：MLP 的结果很好，但双向 LSTM 的结果很差

问题描述

1 个解决方案

解决方案1
1 2020-04-18 21:11:47

Keras：MLP 的结果很好，但双向 LSTM 的结果很差

问题描述

1 个解决方案

解决方案1 1 2020-04-18 21:11:47

解决方案1
1 2020-04-18 21:11:47