简体   繁体   English

如何在gensim中获取给定主题的文档向量

[英]How to get document vectors for a given topic in gensim

I have about 9000 documents and I am using Gensim's doc2vec to embed my documents. 我大约有9000个文档,并且正在使用Gensim的doc2vec嵌入我的文档。 My code is as follows: 我的代码如下:

from gensim.models import doc2vec
from collections import namedtuple

dataset = json.load(open(input_file))

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')

for description in dataset:
    tags = [description[0]]
    words = description[1]
    docs.append(analyzedDocument(words, tags))

model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)

I would like to get all the documents related to topic "deep learning". 我想获取与“深度学习”主题相关的所有文档。 ie the documents that mainly have content related to deep learning. 即主要具有与深度学习有关的内容的文档。 Is it possible to do this in doc2vec model in gensim? 可以在gensim的doc2vec模型中执行此操作吗?

I am happy to provide more details if needed. 如果需要,我很乐意提供更多详细信息。

If there was a document in your training set that was a great example of "deep learning" – say, docs[17] – then after successful training you could ask for documents similar to that example document, and that could be roughly what you'd need. 如果您的训练集中有一个文档是“深度学习”的一个很好的例子–例如docs[17] –那么在成功训练之后,您可以索要与该示例文件相似的文件,而这大概就是您所需要的。 d需要。 For example: 例如:

sims = model.docvecs.most_similar(docs[17].tags[0])

You'd then have in sims a ranked, scored list of the 10 most-similar documents to the tag for the target document. sims中,您将获得与目标文档tag最相似的10个文档的排名,打分列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM