简体繁体 English

顺序学习以进行语言翻译，看不见的单词怎么办

[英]sequence to sequence learning for language translation, what about unseen words

原文 2017-10-26 19:44:48 3 1 machine-learning/ tensorflow/ nlp/ recurrent-neural-network/ pytorch

sequence to sequence learning is a powerful mechanism for language translation, especially using it locally in a context specific case. 序列到序列学习是一种强大的语言翻译机制，尤其是在特定情况下在本地使用它。

I am following this pytorch tutorial for the task. 我正在按照这个pytorch教程完成任务。

However, the tutorial did not split the data into training and testing. 但是，本教程并未将数据分为训练和测试。 You might think its not a big deal, just split it up, use one chunk for training and the other for testing. 您可能会认为这没什么大不了的，只是将其拆分，使用一块进行训练，另一块进行测试。 But it is not that simple. 但这不是那么简单。

Essentially, the tutorial creates the indices of the seen words while leading the dataset. 本质上，本教程在引导数据集时会创建可见单词的索引。 The indices are simply stored in the dictionary. 索引仅存储在字典中。 This is before going to the encoder RNN, just a simple conversion kind of task from words to the numbers. 这是去编码器RNN之前的工作，它只是一种简单的转换任务，即从单词到数字。

If data is split up at random, what happens is, one of the keyword may not appear in the sentences from the training set, and so may not have an index at all. 如果将数据随机拆分，则会发生的情况是，其中一个关键字可能不会出现在训练集中的句子中，因此可能根本没有索引。 If it shows up at the time of testing, what should be done? 如果在测试时显示出来，该怎么办？

Extend the dictionary? 扩展字典？

1 个解决方案

Sequence to sequence models performance strongly depend on count of unique words in vocabulary. 序列到序列模型的性能很大程度上取决于词汇表中唯一词的数量。 Each unique word has to be encountered a number of times in training set, such that model can learn it correct usage. 每个唯一的单词都必须在训练集中遇到多次，以便模型可以学习它的正确用法。 Words that appears few times cannot be used by the model, as model can't learn enough information about such words. 出现几次的单词无法被模型使用，因为模型无法学习有关此类单词的足够信息。 In practice, the size of the dictionary is usually reduced, replacing the rare words with a special "UNK" token. 实际上，通常会减小字典的大小，用特殊的“ UNK”令牌代替稀有单词。 Therefore, if a new word occurs during testing, it can be assumed that it is rare (since it never appears in the training set) and replace it with "UNK". 因此，如果在测试过程中出现一个新单词，则可以假设该单词很少（因为它从未出现在训练集中），并用“ UNK”代替。