简体   繁体   中英

How to get similar words from a custom input dictionary of word to vectors in gensim

I am working on a document similarity problem. For each document, I retrieve the vectors for each of its words (from a pre-trained word embedding model) and average them to get the document vector. I end up having a dictionary (say, my_dict) that maps each document in my collection to its vector.

I want to feed this dictionary to gensim and for each document, get other documents in 'my_dict' that are closer to it. How could I do that?

You might want to consider rephrasing your question (from the title, you are looking for word similarity, from the description I gather you want document similarity) and adding a little more detail in the description. Without more detailed info about what you want and what you have tried, it is difficult to help you achieve what you want, because you could want to do a whole bunch of different things. That being said, I think I can help you out generally, even without know what you want gensim to do. gensim is quite powerful, and offers lots of different functionality.

Assuming your dictionary is already in gensim format, you can load it like this:

from gensim import corpora
dictionary = corpora.Dictionary.load('my_dict.dict')

There - now you can use it with gensim, and run analyses and model to your heart's desire. For similarities between words you can play around with such pre-made functions as gensim.word2vec.most_similar('word_one', 'word_two') etc.

For document similarity with a trained LDA model, see this stackoverflow question .

For a more detailed explanation, see this gensim tutorial which uses cosine similartiy as a measure of similarity between documents.

gensim has a bunch of premade functionality which do not require LDA, for example gensim.similarities.MatrixSimilarity from similarities.docsim , I would recommend looking at the documentation and examples.

Also, in order to avoid a bunch of pitfalls: Is there a specific reason to average the vectors by yourself (or even averaging them at all)? You do not need to do this (gensim has a few more sophisticated methods that achieve a mapping of documents to vectors for you, like models.doc2vec ), and might lose valuable information.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM