简体   繁体   中英

gensim Doc2Vec word not in vocabulary

I am training a doc2vec gensim model with txt file 'full_texts.txt' that contains ~1600 documents. Once I have trained the model, I wish to use similarity methods over words and sentences.

However, since this is my first time using gensim, I am unable to get a solution. If I want to look for similarity by words I try as mentioned below but I get an error that the word doesnt exist in the vocabulary and on the other question is how do I check similarity for entire documents? I have read a lot of questions around it, like this one and looked up documentation but still not sure what I am doing wrong.

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedLineDocument
from gensim.models.doc2vec import TaggedDocument

tagdocs = TaggedLineDocument('full_texts.txt')
d2v_mod = Doc2Vec(min_count=3,vector_size = 200, workers = 2, window = 5, epochs = 30,dm=0,dbow_words=1,seed=42)
d2v_mod.build_vocab(tagdocs)
d2v_mod.train(tagdocs,total_examples=d2v_mod.corpus_count,epochs=20)

d2v_mod.wv.similar_by_word('overdraft',topn=10)
KeyError: "word 'overdraft' not in vocabulary"

Are you sure 'overdraft' appears at least min_count=3 times in your corpus? (For example, what does grep -c " overdraft " full_texts.txt return?)

(Note also that 1600 docs is a very-small corpus for Doc2Vec purposes; published work typically uses at least tens-of-thousands of docs, and often millions.)

In general, if concerned about getting basic functionality working, good ideas are to:

  • follow trustworthy examples - the gensim docs/notebooks directory includes several Jupyter/IPython notebooks demonstrating doc2vec functionality, including the minimal intro doc2vec-lee.ipynb , also viewable online (but it's best to run locally so you can tinker with specifics to learn)

  • enable logging at the INFO level, and watch the output closely to make sure the various reported progress steps, including counts of words/docs and training durations, indicate everything is working sensibly

  • probe the resulting model for expected behavior. For example, is an expected word present in the learned vocabulary? 'overdrafts' in d2v_mod.wv . How many document tags were learned? len(d2v_mod.docvecs) . etc

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM