如何在python中的gensim中获取单词的最近文档

Question

I am using the doc2vec model as follows to construct my document vectors. 我正在使用doc2vec模型，如下所示构造我的文档向量。

from gensim.models import doc2vec
from collections import namedtuple

dataset = json.load(open(input_file))

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')

for description in dataset:
    tags = [description[0]]
    words = description[1]
    docs.append(analyzedDocument(words, tags))

model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)

I have seen that gensim doc2vec also includes word vectors . 我已经看到gensim doc2vec还包括单词向量 。 Suppose I have a word vector created for the word deep learning . 假设我为deep learning一词创建了一个词向量。 My question is; 我的问题是； is it possible to get the documents nearest to deep learning word vector in gensim in python? 是否有可能在gensim中使用Python获得最接近deep learning单词向量的documents ？

I am happy to provide more details if needed. 如果需要，我很乐意提供更多详细信息。

Answer 1

Some Doc2Vec modes will co-train doc-vectors and word-vectors in the "same space". 某些Doc2Vec模式将在“相同空间”中共同训练doc矢量和单词矢量。 Then, if you have a word-vector for 'deep_learning' , you can ask for documents near that vector, and the results may be useful for you. 然后，如果您有一个用于'deep_learning'的单词向量，则可以在该向量附近索取文档，结果可能对您有用。 For example: 例如：

similar_docs = d2v_model.docvecs.most_similar(
                   positive=[d2v_model.wv['deep_learning']]
               )

But: 但：

that's only going to be as good as your model learned 'deep_learning' as a word to mean what you think of it as 就像您的模型学到的'deep_learning'一样好，用一个词来表达您对它的看法
a training set of known-good documents fitting the category 'deep_learning' (and other categories) could be better - whether you hand-curate those, or try to bootstrap from other sources (like say the Wikipedia category ' Deep Learning ' or other curated/search-result sets that you trust). 一组适合于'deep_learning' （和其他类别）类别的已知良好文档的培训可能会更好-无论您是手工整理这些文件，还是尝试从其他来源进行引导（例如Wikipedia类别“ Deep Learning ”或其他策划的/您信任的搜索结果集）。
reducing a category to a single summary point (one vector) may not be as good as having a variety of examples – many points - that all fit the category. 将类别简化为一个汇总点（一个向量）可能不如拥有多个适合该类别的示例（许多点）那样差。 (Relevant docs may not be a neat sphere around a summary point, but rather populate exotically-shaped regions of the doc-vector high-dimensional space.) If you have a lot of good examples of each category, you could train a classifier to then label, or rank-in-relation-to-trained-categories, any further uncategorized docs. （相关文档可能不是汇总点周围的整洁球体，而是填充了doc-vector高维空间的奇异形状区域。）如果每个类别都有很多不错的示例，则可以训练分类器以然后为其他未分类的文档添加标签，或在与培训类别的关系中进行排名。

如何在python中的gensim中获取单词的最近文档

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-07-22 18:42:34

如何在python中的gensim中获取单词的最近文档

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-07-22 18:42:34

解决方案1
1 已采纳 2019-07-22 18:42:34