简体繁体 English

Keras RNN 语言 Model 实现中的输入和 output 层的大小

[英]Size of input and output layers in Keras implementation of an RNN Language Model

原文 2020-05-04 17:23:55 1 1 tensorflow/ keras/ neural-network/ word-embedding/ language-model

As part of my thesis, I am trying to build a recurrent Neural Network Language Model.作为我论文的一部分，我正在尝试构建一个循环神经网络语言 Model。

From theory, I know that the input layer should be a one-hot vector layer with a number of neurons equal to the number of words of our Vocabulary, followed by an Embedding layer, which, in Keras, it apparently translates to a single Embedding layer in a Sequential model.从理论上，我知道输入层应该是一个单热向量层，其神经元数量等于我们词汇表的单词数量，然后是一个嵌入层，在 Keras 中，它显然转换为单个嵌入顺序 model 中的层。 I also know that the output layer should also be the size of our vocabulary so that each output value maps 1-1 to each vocabulary word.我还知道 output 层也应该是我们词汇表的大小，以便每个 output 值映射 1-1 到每个词汇表单词。

However, in both the Keras documentation for the Embedding layer ( https://keras.io/layers/embeddings/ ) and in this article ( https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/#comment-533252 ), the vocabulary size is arbitrarily augmented by one for both the input and the output layers.但是，在嵌入层的 Keras 文档 ( https://keras.io/layers/embeddings/ ) 和本文中 ( Z5E056C500A1C4B6A7110B50D807BADE-toZ5 ) 中。 Neural-language-model-in-keras/#comment-533252 ），输入和 output 层的词汇量任意增加一。 Jason gives an explenation that this is due to the implementation of the Embedding layer in Keras but that doesn't explain why we would also use +1 neuron in the output layer. Jason 解释说这是由于在 Keras 中实现了嵌入层，但这并不能解释为什么我们还会在 output 层中使用 +1 神经元。 I am at the point of wanting to order the possible next words based on their probabilities and I have one probability too many that I do not know to which word to map it too.我正想根据它们的概率对可能的下一个单词进行排序，而且我有一个太多的概率，我也不知道 map 它是哪个单词。

Does anyone know what is the correct way of acheiving the desired result?有谁知道达到预期结果的正确方法是什么？ Did Jason just forget to subtrack one from the output layer and the Embedding layer just needs a +1 for implementation reasons (I mean it's stated in the official API)? Jason 是否只是忘记从 output 层中提取一个子跟踪，而嵌入层出于实现原因只需要 +1（我的意思是官方 API 中已说明）？

Any help on the subject would be appreciated (why is Keras API documentation so laconic?).对此主题的任何帮助将不胜感激（为什么 Keras API 文档如此简洁？）。

Edit:编辑：

This post Keras embedding layer masking.本帖Keras 嵌入层掩蔽。 Why does input_dim need to be |vocabulary| 为什么 input_dim 必须是 |vocabulary| + 2? + 2？ made me think that Jason does in fact have it wrong and that the size of the Vocabulary should not be incremented by one when our word indices are: 0, 1, ..., n-1 .让我觉得 Jason 实际上确实错了，当我们的单词索引为： 0, 1, ..., n-1时，词汇表的大小不应该加一。

However, when using Keras's Tokenizer our word indices are: 1, 2, ..., n .然而，当使用 Keras 的 Tokenizer 时，我们的单词索引是： 1, 2, ..., n 。 In this case, the correct approach is to:在这种情况下，正确的做法是：

Set mask_zero=True , to treat 0 differently, as there is never a 0 (integer) index input in the Embedding layer and keep the vocabulary size the same as the number of vocabulary words ( n )?设置mask_zero=True ，以区别对待 0，因为 Embedding 层中永远没有 0（整数）索引输入，并保持词汇量与词汇量（ n ）相同？
Set mask_zero=True but augment the vocabulary size by one?设置mask_zero=True但将词汇量增加一？
Not set mask_zero=True and keep the vocabulary size the same as the number of vocabulary words?不设置mask_zero=True并保持词汇量与词汇量相同？

1 个解决方案

the reason why we add +1 leads to the possibility that we can encounter a chance to see an unseen word(out of our vocabulary) during testing or in production, it is common to consider a generic term for those UNKNOWN and that is why we add a OOV word in front which resembles all out of vocabulary words.我们添加 +1 的原因导致我们有可能在测试或生产过程中看到一个看不见的单词（超出我们的词汇表），通常考虑为那些 UNKNOWN 提供一个通用术语，这就是为什么我们在前面添加一个OOV词，它类似于所有词汇之外的词。 Check this issue on github which explains it in detail:在 github 上查看此问题，详细说明：

https://github.com/keras-team/keras/issues/3110#issuecomment-345153450 https://github.com/keras-team/keras/issues/3110#issuecomment-345153450