LSTM如何处理可变长度序列

Question

I found a piece of code in Chapter 7,Section 1 of deep Deep Learning with Python as follow： 我在Python深度学习的第7章第1节中找到了一段代码，如下所示：

from keras.models import Model
from keras import layers
from keras import Input

text_vocabulary_size = 10000
question_vocabulary_size = 10000
answer_vocabulary_size = 500

# Our text input is a variable-length sequence of integers.
# Note that we can optionally name our inputs!
text_input = Input(shape=(None,), dtype='int32', name='text')

# Which we embed into a sequence of vectors of size 64
embedded_text = layers.Embedding(64, text_vocabulary_size)(text_input)

# Which we encoded in a single vector via a LSTM
encoded_text = layers.LSTM(32)(embedded_text)

# Same process (with different layer instances) for the question
question_input = Input(shape=(None,), dtype='int32', name='question')
embedded_question = layers.Embedding(32, question_vocabulary_size)(question_input)
encoded_question = layers.LSTM(16)(embedded_question)

# We then concatenate the encoded question and encoded text
concatenated = layers.concatenate([encoded_text, encoded_question], axis=-1)

# And we add a softmax classifier on top
answer = layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated)

# At model instantiation, we specify the two inputs and the output:
model = Model([text_input, question_input], answer)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['acc'])

as you see this model's input don't have raw data's shape information, then after Embedding layer, the input of LSTM or the output of Embedding are some variable length sequence. 如您所见，此模型的输入没有原始数据的形状信息，那么在嵌入图层之后，LSTM的输入或嵌入的输出是一些可变长度序列。

So I want to know: 所以我想知道：

in this model, how keras to determine the number of lstm_unit in LSTM layer 在这个模型中，如何确定LSTM层中lstm_unit的数量
how to deal with variable length sequence 如何处理可变长度序列

Additional information: in order to explain what lstm_unit is (I don't know how to call it,so just show it image): 附加信息：为了解释lstm_unit是什么（我不知道如何调用它，所以只显示它图像）：

Answer 1

The provided recurrent layers inherit from a base implementation keras.layers.Recurrent , which includes the option return_sequences , which defaults to False . 提供的循环层继承自基础实现keras.layers.Recurrent ，其中包含选项return_sequences ，默认为False 。 What this means is that by default, recurrent layers will consume variable-length inputs and ultimately produce only the layer's output at the final sequential step. 这意味着，默认情况下，循环图层将使用可变长度输入，并最终在最后的连续步骤中仅生成图层的输出。

As a result, there is no problem using None to specify a variable-length input sequence dimension. 因此，使用None指定可变长度输入序列维度没有问题。

However, if you wanted the layer to return the full sequence of output, ie the tensor of outputs for each step of the input sequence, then you'd have to further deal with the variable size of that output. 但是，如果您希望图层返回完整的输出序列，即输入序列的每个步骤的输出张量，那么您必须进一步处理该输出的可变大小。

You could do this by having the next layer further accept a variable-sized input, and punt on the problem until later on in your network when eventually you either must calculate a loss function from some variable-length thing, or else calculate some fixed-length representation before continuing on to later layers, depending on your model. 你可以通过让下一层进一步接受一个可变大小的输入来解决这个问题，直到你的网络中的问题直到最后你必须从一些可变长度的东西计算一个损失函数，或者计算一些固定的 - 在继续到后面的图层之前的长度表示，具体取决于您的模型。

Or you could do it by requiring fixed-length sequences, possibly with padding the end of the sequences with special sentinel values that merely indicate an empty sequence item purely for padding out the length. 或者你可以通过要求固定长度的序列来做到这一点，可能用序列的末尾填充特殊的sentinel值，这些值只表示一个空的序列项纯粹用于填充长度。

Separately, the Embedding layer is a very special layer that is built to handle variable length inputs as well. 另外， Embedding层是一个非常特殊的层，用于处理可变长度输入。 The output shape will have a different embedding vector for each token of the input sequence, so the shape with be (batch size, sequence length, embedding dimension). 对于输入序列的每个标记，输出形状将具有不同的嵌入向量，因此形状具有（批量大小，序列长度，嵌入维度）。 Since the next layer is LSTM, this is no problem ... it will happily consume variable-length sequences as well. 由于下一层是LSTM，这没有问题......它也会愉快地使用可变长度序列。

But as it is mentioned in the documentation on Embedding : 但正如Embedding文档中提到的那样：

input_length: Length of input sequences, when it is constant.
      This argument is required if you are going to connect
      `Flatten` then `Dense` layers upstream
      (without it, the shape of the dense outputs cannot be computed).

If you want to go directly from Embedding to a non-variable-length representation, then you must supply the fixed sequence length as part of the layer. 如果要直接从Embedding转到非可变长度表示，则必须提供固定序列长度作为图层的一部分。

Finally, note that when you express the dimensionality of the LSTM layer, such as LSTM(32) , you are describing the dimensionality of the output space of that layer. 最后，请注意，当您表达LSTM图层的维度（例如LSTM(32) ，您将描述该图层的输出空间的维度。

# example sequence of input, e.g. batch size is 1.
[
 [34], 
 [27], 
 ...
] 
--> # feed into embedding layer

[
  [64-d representation of token 34 ...],
  [64-d representation of token 27 ...],
  ...
] 
--> # feed into LSTM layer

[32-d output vector of the final sequence step of LSTM]

In order to avoid the inefficiency of a batch size of 1, one tactic is to sort your input training data by the sequence length of each example, and then group into batches based on common sequence length, such as with a custom Keras DataGenerator. 为了避免批量大小为1的低效率，一种策略是按照每个示例的序列长度对输入训练数据进行排序，然后根据公共序列长度分组，例如使用自定义Keras DataGenerator。

This has the advantage of allowing large batch sizes, especially if your model may need something like batch normalization or involves GPU-intensive training, and even just for the benefit of a less noisy estimate of the gradient for batch updates. 这样做的优点是允许大批量，特别是如果您的模型可能需要批量标准化或涉及GPU密集型培训，甚至只是为了批量更新的梯度估计较少的好处。 But it still lets you work on an input training data set that has different batch lengths for different examples. 但它仍然允许您处理具有不同批处理长度的输入训练数据集以用于不同的示例。

More importantly though, it also has the big advantage that you do not have to manage any padding to ensure common sequence lengths in the input. 更重要的是，它还具有很大的优势，您无需管理任何填充以确保输入中的常见序列长度。

Answer 2

How does it deal with units? 它如何处理单位？

Units are totally independend of length, so, there is nothing special being done. 单位完全独立，所以没有什么特别的。

Length only increases the "recurrent steps", but recurrent steps use always the same cells over and over. 长度仅增加“循环步骤”，但循环步骤反复使用相同的单元格。

The number of cells is fixed and defined by the user: 单元格数由用户修复并定义：

the first LSTM has 32 cells/units 第一个LSTM有32个单元/单元
the second LSTM has 16 cells/units 第二个LSTM有16个单元/单元

How to deal with variable length? 如何处理变长？

Approach 1: create separate batches of 1 sequence, each batch with its own length. 方法1：创建1个序列的单独批次，每个批次都有自己的长度。 Feed each batch to the model individually. 将每批次单独送入模型。 Methods train_on_batch and predict_on_batch inside a manual loop are the easiest form. 方法中的train_on_batch和predict_on_batch方法是最简单的形式。
- Ideally, separate batches per length, each batch collects all sequences with same length 理想情况下，每个长度分开批次，每批次收集相同长度的所有序列
Approach 2: create a fixed length batch, fill the unused tail lenght of each sequence with 0, use the parameter mask_zero=True in the embedding layers. 方法2：创建一个固定长度的批处理，用0填充每个序列的未使用的尾长度，在嵌入层中使用参数mask_zero=True 。
- Be careful not to use 0 as an actual word or meaningful data in the inputs of the embeddings. 注意不要在嵌入的输入中使用0作为实际单词或有意义的数据。

LSTM如何处理可变长度序列

问题描述

2 个解决方案

解决方案1
4 已采纳 2018-04-19 16:22:41

解决方案2
2 2018-04-19 17:08:14

LSTM如何处理可变长度序列

问题描述

2 个解决方案

解决方案1 4 已采纳 2018-04-19 16:22:41

解决方案2 2 2018-04-19 17:08:14

解决方案1
4 已采纳 2018-04-19 16:22:41

解决方案2
2 2018-04-19 17:08:14