I execte doc2vec model for text similarity my code and not obtain reslt
it = LabeledLineSentence(datafiles, labels1)
model = gensim.models.Doc2Vec(vector_size=300, min_count=0, alpha=0.025, min_alpha=0.025)
model.build_vocab(it)
#training of model
for epoch in range(100):
print ('iteration '+str(epoch+1))
model.train(it,total_examples=model.corpus_count,
epochs=model.epochs)
model.alpha -= 0.002
model.min_alpha = model.alpha
#saving the created model
model.save('doc2vec.model')
print ("model saved")
#loading the model
d2v_model = gensim.models.doc2vec.Doc2Vec.load('doc2vec.model')
#start testing
seed_text = "consider illegal immoral plagiarism do various"
tokens1 = seed_text.lower().split()
vector1 = d2v_model.infer_vector(tokens1)
#to get most similar document with similarity scores using document-index
most_similar = d2v_model.docvecs.most_similar(positive = [vector1] )
# output_sentences(most_similar)
print(u'%s %s: %s\n' % ("Most", most_similar[0][1], data[int(most_similar[0][0])]))
It output
Most 0.14691241085529327: M
why not print data bt only M what mean, what can i do to solve the problem Regards
You're using a version of LabeledLineSentence
that doesn't match the code that used to be in Gensim. Your version is taking an extra labels1
argument. So, it's non-standard and you should show its code or explain what online example you're basing your code on. Similarly, it's not clear what the values, or indirect contents, of datafiles
and labels1
might be.
The M
in the output is the result of your code data[int[most_similar[0][0])]
. Your code doesn't show what data
is, but perhaps it's a string, and the character M
is in whatever position int(most_similar[0][0])
evaluates to.
(The value of most_similar[0][0]
should be the document-tag that's most-similar to your inferred text-vector, which might be an int or string, depending on how you prepared your training data, in the unshown LabeledLineSentence
code. There must have been a document in the trining set with that as a tag.)
The number 0.14691241085529327
is the amount of similarity. That's not very much, so your probe inferred text isn't very similar to any training document. (Perhaps that's indicative of some other problem.)
Your code also shows a few bad practices:
train()
more than once & using non-default min_alpha
you're manipulating yourself - see answer My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong? for more detailsmin_count=0
- almost always a bad idea, Doc2Vec
& similar algorithms benefit from ignoring rare wordsvector_size=300
- this would only be appropriate with some very large training corpus, the kind you'd most likely use a much-larger-than-default min_count
on, and attempt only after gaining success with smaller experimentsI suggest you not trust or use whatever online article motivated this code, & instead start from examples inside the Gensim docs, gradually building them towards your need.
Other generic good steps:
it
you've created.For example, if you run:
first_item = next(iter(it))
print('tags: %s\nwords: %s' % (first_item.tags, first_item.words))
Does it print the 1st document you intended to use as training material, with the right words and tags? If not, you've got problems in your data source.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.