简体   繁体   中英

pre-trained Word2Vec with LSTM, predict next word in sentence

I have a corpus of text. For a preprocessing data I've vectorized all text using gensim Word2Vec. I don't understand what I do exactly wrong. For the base I've took this discussion (and good tutorial) predict next word . Code: Source code .

As input I have lines of sentences. I want to take each line, then take word[0] of this line -> predict word[1 ]. Then using word[0] and word[1 ] predict word[3], and so on to the end of line.

In this tutorial each time predicts fix length of words. What I do:

def on_epoch_end(epoch, _):
    print('\nGenerating text after epoch: %d' % epoch)
    for sentence in inpt:
        word_first=sentence.split()[0]
        sample = generate_next(word_first, len(sentence))
        print('%s... -> %s' % (word_first, sample))

I take first word and use it to generate all next. And as second parameter I give length of sentence (not num_generated=10 ) as in tutorial. But it doesn't help for me at all. Every time I'm getting output predicted sequence of words with random(in my opinion) length.

What am I doing wrong and how to fix it?

My testing script:

texts = [
    'neural network',
    'this',
    'it is very',
]
for text in texts:
  print('%s... -> %s' % (text, generate_next(text, num_generated=5)))

The output:

neural network... -> neural network that making isometry adopted riskaverting
this... -> this dropout formalize locally secondly spectrogram
it is very... -> it is very achievable machinery our past possibly

You can see that the output's length is num_generated plus the input's length.

I guess you are expecting to see all output to have length of num_generated . But this is not how generate_next works. This function actually generates num_generated words, and append them to the original input.

If you want to have output of fixed length, try:

generate_next(text, num_generated=5-len(text.split()))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM