I have a corpus of text. For a preprocessing data I've vectorized all text using gensim Word2Vec. I don't understand what I do exactly wrong. For the base I've took this discussion (and good tutorial) predict next word . Code: Source code .
As input I have lines of sentences. I want to take each line, then take word[0] of this line -> predict word[1 ]. Then using word[0] and word[1 ] predict word[3], and so on to the end of line.
In this tutorial each time predicts fix length of words. What I do:
def on_epoch_end(epoch, _):
print('\nGenerating text after epoch: %d' % epoch)
for sentence in inpt:
word_first=sentence.split()[0]
sample = generate_next(word_first, len(sentence))
print('%s... -> %s' % (word_first, sample))
I take first word and use it to generate all next. And as second parameter I give length of sentence (not num_generated=10
) as in tutorial. But it doesn't help for me at all. Every time I'm getting output predicted sequence of words with random(in my opinion) length.
What am I doing wrong and how to fix it?
My testing script:
texts = [
'neural network',
'this',
'it is very',
]
for text in texts:
print('%s... -> %s' % (text, generate_next(text, num_generated=5)))
The output:
neural network... -> neural network that making isometry adopted riskaverting
this... -> this dropout formalize locally secondly spectrogram
it is very... -> it is very achievable machinery our past possibly
You can see that the output's length is num_generated plus the input's length.
I guess you are expecting to see all output to have length of num_generated
. But this is not how generate_next
works. This function actually generates num_generated
words, and append them to the original input.
If you want to have output of fixed length, try:
generate_next(text, num_generated=5-len(text.split()))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.