gensim LDA主题建模中的固定大小主题向量，用于查找相似文本

Question

I use gensim LDA topic modelling to find topics for each document and to check the similarity between documents by comparing the received topics vectors. 我使用gensim LDA主题建模来查找每个文档的主题，并通过比较接收到的主题向量来检查文档之间的相似性。 Each document is given a different number of matching topics, so the comparison of the vector (by cosine similarity) is incorrect because vectors of the same length are required. 每个文档都有不同数量的匹配主题，因此矢量的比较（通过余弦相似性）是不正确的，因为需要长度相同的矢量。

This is the related code: 这是相关代码：

lda_model_bow = models.LdaModel(corpus=bow_corpus, id2word=dictionary, num_topics=3, passes=1, random_state=47)

#---------------Calculating and Viewing the topics----------------------------
vec_bows = [dictionary.doc2bow(filtered_text.split()) for filtered_text in filtered_texts]

vec_lda_topics=[lda_model_bow[vec_bow] for vec_bow in vec_bows]

for id,vec_lda_topic in enumerate(vec_lda_topics):
    print ('document ' ,id, 'topics: ', vec_lda_topic)

The output vectors is: 输出向量为：

document  0 topics:  [(1, 0.25697246), (2, 0.08026043), (3, 0.65391296)]
document  1 topics:  [(2, 0.93666667)]
document  2 topics:  [(2, 0.07910537), (3, 0.20132676)]
.....

As you can see, each vector has a different length, so it is not possible to perform cosine similarity between them. 如您所见，每个向量都有不同的长度，因此不可能在它们之间执行余弦相似度。

I would like the output to be: 我希望输出为：

document  0 topics:  [(1, 0.25697246), (2, 0.08026043), (3, 0.65391296)]
document  1 topics:  [(1, 0.0), (2, 0.93666667), (3, 0.0)]
document  2 topics:  [(1, 0.0), (2, 0.07910537), (3, 0.20132676)]
.....

Any ideas how to do it? 有什么想法怎么做？ tnx n

Answer 1

I have used gensim for topic modeling before and I had not faced this issue. 我以前曾使用gensim进行主题建模，但我没有遇到过这个问题。 Ideally, if you pass num_topics=3 then it returns top 3 topics with the highest probability for each document. 理想情况下，如果您传递num_topics=3则它会为每个文档返回概率最高的前3个主题。 And then you should be able to generate the cosine similarity matrix by doing something like this: 然后，您应该可以通过执行以下操作来生成余弦相似度矩阵：

lda_model_bow = models.LdaModel(corpus=bow_corpus, id2word=dictionary, num_topics=3, passes=1, random_state=47)
vec_lda_topics = lda_model_bow[bow_corpus]
sim_matrix = similarities.MatrixSimilarity(vec_lda_topics)

But for some reason, if you are getting unequal number of topics you can assume a zero probability value for the remaining topics and include them in your vector when you calculate similarity. 但是由于某种原因，如果主题数量不相等，则可以为其余主题假定零概率值，并在计算相似度时将它们包括在向量中。

Ps: If you could provide a sample of your input documents, it would be easier to reproduce your output and look into it. 附：如果您可以提供输入文档的样本，则更容易重现输出并对其进行调查。

Answer 2

因此，正如panktijk在评论以及本主题中所说的那样，解决方案是将minimum_probability从默认值0.01更改为0.0 。

gensim LDA主题建模中的固定大小主题向量，用于查找相似文本

问题描述

2 个解决方案

解决方案1
1 2018-11-21 17:18:47

解决方案2
0 已采纳 2018-11-21 19:02:42

gensim LDA主题建模中的固定大小主题向量，用于查找相似文本

问题描述

2 个解决方案

解决方案1 1 2018-11-21 17:18:47

解决方案2 0 已采纳 2018-11-21 19:02:42

解决方案1
1 2018-11-21 17:18:47

解决方案2
0 已采纳 2018-11-21 19:02:42