简体   繁体   中英

Determine most similar phrase with word2vec

I create a Python script for training and inferring test document vectors using doc2vec.

My problem is when I try to determine the most similar phrase for example ("the world") it shows me only on the list of most similar words. It didn't shows the list of most similar phrase.

Am I missing something in my code?

#python example to infer document vectors from trained doc2vec model
import gensim.models as g
import codecs

#parameters
model="toy_data/model.bin"
test_docs="toy_data/test_docs.txt"
output_file="toy_data/test_vectors.txt"

#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

#load model
m = g.Doc2Vec.load(model)
test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]

#infer test vectors
output = open(output_file, "w")
for d in test_docs:
    output.write( " ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n" )
output.flush()
output.close()


m.most_similar('the word'.split())

I get this list :

[('refutations', 0.9990279078483582),
 ('volume', 0.9989271759986877),
 ('italic', 0.9988381266593933),
 ('syllogisms', 0.998751699924469),
 ('power', 0.9987285137176514),
 ('alibamu', 0.9985184669494629),
 ("''", 0.99847412109375),
 ('roman', 0.9984466433525085),
 ('soil', 0.9984269738197327),
 ('plants', 0.9984176754951477)]

The Doc2Vec model collects its doc-vectors for later lookup or search in a property .docvecs . To get doc-vector results, you would perform a most_similar() on that property. If your Doc2Vec instance is held in a variable d2v_model , and doc_id holds one of the known doc-tags from training, that might be:

d2v_model.docvecs.most_similar(doc_id)

If you were inferring a vector for a new document, and looking up training docs similar to that inferred vector, your code might be like:

new_dv = d2v_model.infer_vector('some new document'.split())
d2v_model.docvecs.most_similar(positive=[new_dv])

(The Doc2Vec model class is derived from the very-similar Word2Vec class, and thus inherits a most_similar() which by default consults just the internal word-vectors. Those word-vectors might be useful, in some Doc2Vec modes, or random – but it's best to use either d2v_model.wv.most_similar() or d2v_model.docvecs.most_similar() to be clear.)

Basic Doc2Vec examples, like the notebook installed with gensim in the docs/notebooks directory doc2vec-lee.ipynb , contain useful examples.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM