简体   繁体   中英

Gensim Doc2Vec Most_Similar

I'm having trouble with the most_similar method in Gensim's Doc2Vec model. When I run most_similar I only get the similarity of the first 10 tagged documents (based on their tags-always from 0-9). For this code I have topn=5, but I've used topn=len(documents) and I still only get the similarity for the first 10 documents

Tagged documents:

tokenizer = RegexpTokenizer(r'\w+')
taggeddoc=[]

for index,wod in enumerate(model_data):
    wordslist=[]
    tagslist=[]
    tokens = tokenizer.tokenize(wod)

    td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(tokens))).split(), str(index)) 
    taggeddoc.append(td)

documents=taggeddoc

Instantiate the model:

model=gensim.models.Doc2Vec(documents, dm=0, dbow_words=1, iter=1, alpha=0.025, min_alpha=0.025, min_count=10)

Train the model:

for epoch in range(100):
    if epoch % 10 == 0:
        print("Training epoch {}".format(epoch))
    model.train(documents, total_examples=model.corpus_count, epochs=model.iter)
    model.alpha -= 0.002
    model.min_alpha = model.alpha

Problem is here (I think):

new = model_data[100].split()
new_vector = model.infer_vector(new)
sims = model.docvecs.most_similar([new_vector], topn=5)
print(sims)

Output:

[('3', 0.3732905089855194), ('1', 0.36121609807014465), ('7', 0.35790640115737915), ('9', 0.3569292724132538), ('2', 0.3521473705768585)]

Length of documents is the same before and after training the model. Not sure why it's only returning similarity for the first 10 documents.

Side question: In anyone's experience, is it better to use Word2Vec or Doc2Vec if the input documents are very short (~50 words) and there are >2,000 documents? Thanks for the help!

The second argument to TaggedDocument() , tags , should be a list-of-tags, not a single string.

By supplying single strings of simple integers like '109' , that's being interpreted as the list-of-tags ['1', '0', '9'] - and thus across your whole corpus, only 10 unique tags, the digits 0-9, will be encountered/trained.

Make it a single-tag list, like [str(index)] , and you'll get results more like what you expect.

Regarding your side question, both Word2Vec and Doc2Vec work best on large corpuses with millions of words in the training data. A mere 2,000 documents * at most 50 words each, giving at most 100,000 training-words, is very very small for these algorithms. You might be able to eke out some slight results by using a much-smaller size model and many-more training iter passes, but that's not the kind of dataset/problem on which these algorithms work well.

Separately, your training code is totally wrong.

  • If you supply documents to the Doc2Vec initialization, it will do all of its needed vocabulary-discovery and iter training passes automatically – don't call train() any more.

  • And if for some reason you don't provide documents at initialization, you should typically then call both build_vocab() and train() each exactly once.

  • Almost no-one should be changing min_alpha or calling train() more than once in an explicit loop: you are almost certain to do it wrong, as here, where you'll decrement the effective alpha from 0.025 by 0.002 over 100 loops, winding up with a nonsensical negative learning rate of -0.175. Don't so this, and if you copied this approach from what seemed to be a credible online source, please let that source know their code is confused.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM