如何使用gensim LDA获取文档的完整主题分发？

Question

When I train my lda model as such 当我训练我的lda模型时

dictionary = corpora.Dictionary(data)
corpus = [dictionary.doc2bow(doc) for doc in data]
num_cores = multiprocessing.cpu_count()
num_topics = 50
lda = LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary, 
workers=num_cores, alpha=1e-5, eta=5e-1)

I want to get a full topic distribution for all num_topics for each and every document. 我希望为每个文档获取所有num_topics的完整主题分发。 That is, in this particular case, I want each document to have 50 topics contributing to the distribution and I want to be able to access all 50 topics' contribution. 也就是说，在这种特殊情况下，我希望每个文档都有50个主题有助于分发，我希望能够访问所有50个主题的贡献。 This output is what LDA should do if adhering strictly to the mathematics of LDA. 如果严格遵守LDA的数学，这个输出就是LDA应该做的。 However, gensim only outputs topics that exceed a certain threshold as shown here . 然而，仅gensim输出超过一定的阈值，如图主题在这里 。 For example, if I try 例如，如果我尝试

lda[corpus[89]]
>>> [(2, 0.38951721864890398), (9, 0.15438596408262636), (37, 0.45607443684895665)]

which shows only 3 topics that contribute most to document 89. I have tried the solution in the link above, however this does not work for me. 它只显示了对文档89贡献最大的3个主题。我在上面的链接中尝试了解决方案，但这对我不起作用。 I still get the same output: 我仍然得到相同的输出：

theta, _ = lda.inference(corpus)
theta /= theta.sum(axis=1)[:, None]

produces the same output ie only 2,3 topics per document. 产生相同的输出，即每个文档只有2,3个主题。

My question is how do I change this threshold so I can access the FULL topic distribution for each document? 我的问题是如何更改此阈值，以便我可以访问每个文档的完整主题分布？ How can I access the full topic distribution, no matter how insignificant the contribution of a topic to a document? 无论主题对文档的贡献多么微不足道，我如何访问完整的主题分发？ The reason I want the full distribution is so I can perform a KL similarity search between documents' distribution. 我想要完整分发的原因是我可以在文档的分发之间执行KL相似性搜索。

Thanks in advance 提前致谢

Answer 1

It doesnt seem that anyone has replied yet, so I'll try and answer this as best I can given the gensim documentation . 似乎没有人回复，所以我会尽力回答这个问题，因为我可以给出gensim 文档。

It seems you need to set a parameter minimum_probability to 0.0 when training the model to get the desired results: 在训练模型以获得所需结果时，您似乎需要将参数minimum_probability设置为0.0：

lda = LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=num_cores, alpha=1e-5, eta=5e-1,
              minimum_probability=0.0)

lda[corpus[233]]
>>> [(0, 5.8821799358842424e-07),
 (1, 5.8821799358842424e-07),
 (2, 5.8821799358842424e-07),
 (3, 5.8821799358842424e-07),
 (4, 5.8821799358842424e-07),
 (5, 5.8821799358842424e-07),
 (6, 5.8821799358842424e-07),
 (7, 5.8821799358842424e-07),
 (8, 5.8821799358842424e-07),
 (9, 5.8821799358842424e-07),
 (10, 5.8821799358842424e-07),
 (11, 5.8821799358842424e-07),
 (12, 5.8821799358842424e-07),
 (13, 5.8821799358842424e-07),
 (14, 5.8821799358842424e-07),
 (15, 5.8821799358842424e-07),
 (16, 5.8821799358842424e-07),
 (17, 5.8821799358842424e-07),
 (18, 5.8821799358842424e-07),
 (19, 5.8821799358842424e-07),
 (20, 5.8821799358842424e-07),
 (21, 5.8821799358842424e-07),
 (22, 5.8821799358842424e-07),
 (23, 5.8821799358842424e-07),
 (24, 5.8821799358842424e-07),
 (25, 5.8821799358842424e-07),
 (26, 5.8821799358842424e-07),
 (27, 0.99997117731831464),
 (28, 5.8821799358842424e-07),
 (29, 5.8821799358842424e-07),
 (30, 5.8821799358842424e-07),
 (31, 5.8821799358842424e-07),
 (32, 5.8821799358842424e-07),
 (33, 5.8821799358842424e-07),
 (34, 5.8821799358842424e-07),
 (35, 5.8821799358842424e-07),
 (36, 5.8821799358842424e-07),
 (37, 5.8821799358842424e-07),
 (38, 5.8821799358842424e-07),
 (39, 5.8821799358842424e-07),
 (40, 5.8821799358842424e-07),
 (41, 5.8821799358842424e-07),
 (42, 5.8821799358842424e-07),
 (43, 5.8821799358842424e-07),
 (44, 5.8821799358842424e-07),
 (45, 5.8821799358842424e-07),
 (46, 5.8821799358842424e-07),
 (47, 5.8821799358842424e-07),
 (48, 5.8821799358842424e-07),
 (49, 5.8821799358842424e-07)]

Answer 2

In case it may help someone else: 万一它可以帮助别人：

After training your LDA model, if you want to get all topics of a document, without limiting with a lower threshold, you should set minimum_probability to 0 when calling the get_document_topics method. 在训练LDA模型之后，如果要获取文档的所有主题，而不限制较低的阈值，则应在调用get_document_topics方法时将minimum_probability设置为0。

ldaModel.get_document_topics(bagOfWordOfADocument, minimum_probability=0.0)

如何使用gensim LDA获取文档的完整主题分发？

问题描述

2 个解决方案

解决方案1
7 已采纳 2017-07-27 17:47:10

解决方案2
2 2018-10-19 13:42:25

如何使用gensim LDA获取文档的完整主题分发？

问题描述

2 个解决方案

解决方案1 7 已采纳 2017-07-27 17:47:10

解决方案2 2 2018-10-19 13:42:25

解决方案1
7 已采纳 2017-07-27 17:47:10

解决方案2
2 2018-10-19 13:42:25