简体   繁体   English

使用LSTM预训练Word2Vec,预测句子中的下一个单词

[英]pre-trained Word2Vec with LSTM, predict next word in sentence

I have a corpus of text. 我有文字语料库。 For a preprocessing data I've vectorized all text using gensim Word2Vec. 对于预处理数据,我使用gensim Word2Vec对所有文本进行了矢量化处理。 I don't understand what I do exactly wrong. 我不明白我做错了什么。 For the base I've took this discussion (and good tutorial) predict next word . 作为基础,我已经进行了此讨论(以及不错的教程), 预测下一个单词 Code: Source code . 代码: 源代码

As input I have lines of sentences. 作为输入,我有句子行。 I want to take each line, then take word[0] of this line -> predict word[1 ]. 我要接受每一行,然后接受此行的word [0]->预测word [1]。 Then using word[0] and word[1 ] predict word[3], and so on to the end of line. 然后使用word [0]和word [1]预测word [3],依此类推直到行尾。

In this tutorial each time predicts fix length of words. 在本教程中,每次都会预测单词的固定长度。 What I do: 我所做的:

def on_epoch_end(epoch, _):
    print('\nGenerating text after epoch: %d' % epoch)
    for sentence in inpt:
        word_first=sentence.split()[0]
        sample = generate_next(word_first, len(sentence))
        print('%s... -> %s' % (word_first, sample))

I take first word and use it to generate all next. 我选择第一个单词,然后用它生成所有单词。 And as second parameter I give length of sentence (not num_generated=10 ) as in tutorial. 作为第二个参数,我给出了句子的长度(不是num_generated=10 ),如本教程所述。 But it doesn't help for me at all. 但这对我完全没有帮助。 Every time I'm getting output predicted sequence of words with random(in my opinion) length. 每次我输出预测长度为随机(在我看来)的单词序列时。

What am I doing wrong and how to fix it? 我在做什么错以及如何解决?

My testing script: 我的测试脚本:

texts = [
    'neural network',
    'this',
    'it is very',
]
for text in texts:
  print('%s... -> %s' % (text, generate_next(text, num_generated=5)))

The output: 输出:

neural network... -> neural network that making isometry adopted riskaverting
this... -> this dropout formalize locally secondly spectrogram
it is very... -> it is very achievable machinery our past possibly

You can see that the output's length is num_generated plus the input's length. 您可以看到输出的长度是num_generated加输入的长度。

I guess you are expecting to see all output to have length of num_generated . 我猜您希望看到所有输出的长度都为num_generated But this is not how generate_next works. 但这不是generate_next工作方式。 This function actually generates num_generated words, and append them to the original input. 此函数实际上生成num_generated单词,并将它们附加到原始输入。

If you want to have output of fixed length, try: 如果要输出固定长度的输出,请尝试:

generate_next(text, num_generated=5-len(text.split()))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Gensim word2vec 扩充或合并预训练向量 - Gensim word2vec augment or merge pre-trained vectors 如何加载预训练的 Word2vec 模型文件? - How to load a pre-trained Word2vec MODEL File? Word2Vec:使用 Gensim 上传预训练的 word2vec 文件时收到错误 - Word2Vec: Error received at uploading a pre-trained word2vec file using Gensim Gensim 的 Doc2Vec - 如何使用预训练的 word2vec(词相似性) - Gensim's Doc2Vec - How to use pre-trained word2vec (word similarities) 在 TensorFlow 中使用预训练的词嵌入(word2vec 或 Glove) - Using a pre-trained word embedding (word2vec or Glove) in TensorFlow 如何在Keras中将自己的词嵌入与像word2vec这样的预训练嵌入一起使用 - How to use own word embedding with pre-trained embedding like word2vec in Keras 如何从word2vec的Google预训练模型中提取单词向量? - How to extract a word vector from the Google pre-trained model for word2vec? 如何在不手动下载模型的情况下访问/使用Google预先训练的Word2Vec模型? - How to access/use Google's pre-trained Word2Vec model without manually downloading the model? 如何加载预训练的 Word2vec MODEL 文件并重用它? - How to load a pre-trained Word2vec MODEL File and reuse it? 将预先训练的word2vec向量注入TensorFlow seq2seq - Injecting pre-trained word2vec vectors into TensorFlow seq2seq
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM