简体   繁体   English

如何从Gensim的自定义输入单词词典到矢量获取相似单词

[英]How to get similar words from a custom input dictionary of word to vectors in gensim

I am working on a document similarity problem. 我正在研究文档相似性问题。 For each document, I retrieve the vectors for each of its words (from a pre-trained word embedding model) and average them to get the document vector. 对于每个文档,我(从预训练的单词嵌入模型中)检索每个单词的向量,并将它们取平均值以得到文档向量。 I end up having a dictionary (say, my_dict) that maps each document in my collection to its vector. 我最终得到了一个字典(例如,my_dict),该字典将集合中的每个文档映射到其向量。

I want to feed this dictionary to gensim and for each document, get other documents in 'my_dict' that are closer to it. 我想将此字典提供给gensim,对于每个文档,请在“ my_dict”中获取更接近它的其他文档。 How could I do that? 我该怎么办?

You might want to consider rephrasing your question (from the title, you are looking for word similarity, from the description I gather you want document similarity) and adding a little more detail in the description. 您可能需要考虑改述您的问题(从标题开始,您正在寻找单词相似性,从我收集的描述中您希望获得文档相似性),并在描述中添加更多细节。 Without more detailed info about what you want and what you have tried, it is difficult to help you achieve what you want, because you could want to do a whole bunch of different things. 没有关于您想要的东西和尝试过的东西的详细信息,很难帮助您实现想要的东西,因为您可能想做很多不同的事情。 That being said, I think I can help you out generally, even without know what you want gensim to do. 话虽这么说,我想我可以为您提供一般帮助,即使您不知道gensim想要做什么。 gensim is quite powerful, and offers lots of different functionality. gensim非常强大,并提供许多不同的功能。

Assuming your dictionary is already in gensim format, you can load it like this: 假设您的字典已经是gensim格式,则可以像这样加载它:

from gensim import corpora
dictionary = corpora.Dictionary.load('my_dict.dict')

There - now you can use it with gensim, and run analyses and model to your heart's desire. 在那里-现在您可以将其与gensim结合使用,并根据您的心愿进行分析和建模。 For similarities between words you can play around with such pre-made functions as gensim.word2vec.most_similar('word_one', 'word_two') etc. 对于单词之间的相似性,您可以使用诸如gensim.word2vec.most_similar('word_one', 'word_two')等预制函数。

For document similarity with a trained LDA model, see this stackoverflow question . 有关与经过训练的LDA模型的文档相似性,请参见此stackoverflow问题

For a more detailed explanation, see this gensim tutorial which uses cosine similartiy as a measure of similarity between documents. 有关更详细的说明,请参阅此gensim教程该教程使用余弦相似度作为文档之间相似度的度量。

gensim has a bunch of premade functionality which do not require LDA, for example gensim.similarities.MatrixSimilarity from similarities.docsim , I would recommend looking at the documentation and examples. gensim有一堆预制功能不需要LDA,例如gensim.similarities.MatrixSimilaritysimilarities.docsim ,我会建议看文档和示例。

Also, in order to avoid a bunch of pitfalls: Is there a specific reason to average the vectors by yourself (or even averaging them at all)? 另外,为了避免一堆陷阱:是否有特定的原因需要您自己对向量进行平均(甚至是对它们进行平均)? You do not need to do this (gensim has a few more sophisticated methods that achieve a mapping of documents to vectors for you, like models.doc2vec ), and might lose valuable information. 您不需要这样做(gensim有一些更复杂的方法,可以为您实现文档到矢量的映射,例如models.doc2vec ),并且可能会丢失有价值的信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM