从gensim LDA模型中提取Topic分布

Question

I created an LDA model for some text files using gensim package in python.我在 python 中使用 gensim 包为一些文本文件创建了一个 LDA 模型。 I want to get topic's distributions for the learned model.我想获取学习模型的主题分布。 Is there any method in gensim ldamodel class or a solution to get topic's distributions from the model? gensim ldamodel 类中是否有任何方法或从模型中获取主题分布的解决方案？ For example, I use the coherence model to find a model with the best cohrence value subject to the number of topics in range 1 to 5. After getting the best model I use get_document_topics method (thanks kenhbs ) to get topic distribution in the document that used for creating the model.例如，我使用一致性模型来查找具有最佳一致性值的模型，主题数量范围为 1 到 5。在获得最佳模型后，我使用 get_document_topics 方法（感谢kenhbs ）获取文档中的主题分布用于创建模型。

id2word = corpora.Dictionary(doc_terms)

bow = id2word.doc2bow(doc_terms)

max_coherence = -1
best_lda_model = None

for num_topics in range(1, 6):

    lda_model = gensim.models.ldamodel.LdaModel(corpus=bow, num_topics=num_topics)

    coherence_model = gensim.models.CoherenceModel(model=lda_model, texts=doc_terms,dictionary=id2word)

    coherence_value = coherence_model.get_coherence()

    if coherence_value > max_coherence:
        max_coherence = coherence_value
        best_lda_model = lda_model

The best has 4 topics最好有4个主题

print(best_lda_model.num_topics)

4

But when I use get_document_topics, I get less than 4 values for document distribution.但是当我使用 get_document_topics 时，我得到的文档分发值少于 4 个。

topic_ditrs = best_lda_model.get_document_topics(bow)

print(len(topic_ditrs))

3

My question is: For best lda model with 4 topics (using coherence model) for a document, why get_document_topics returns fewer topics for the same document?我的问题是：对于具有 4 个主题（使用一致性模型）的文档的最佳 lda 模型，为什么 get_document_topics 为同一文档返回较少的主题？ why some topics have very small distribution (less than 1-e8)?为什么有些主题的分布很小（小于 1-e8）？

Answer 1

From the documentation , you can use two methods for this.从文档中，您可以使用两种方法。

If you are aiming to get the main terms in a specific topic, use get_topic_terms :如果您的目标是获取特定主题中的主要术语，请使用get_topic_terms ：

from gensim.model.ldamodel import LdaModel

K = 10
lda = LdaModel(some_corpus, num_topics=K)

lda.get_topic_terms(5, topn=10)
# Or for all topics
for i in range(K):
    lda.get_topic_terms(i, topn=10)

You can also print the entire underlying np.ndarray (called either beta or phi in standard LDA papers, dimensions are (K, V) or (V, K)).您还可以打印整个底层np.ndarray （在标准 LDA 论文中称为 beta 或 phi，维度为 (K, V) 或 (V, K)）。

phi = lda.get_topics()

edit : From the link i included in the original answer: if you are looking for a document's topic distribution, use编辑：从我包含在原始答案中的链接：如果您正在寻找文档的主题分布，请使用

res = lda.get_document_topics(bow)

As can be read from the documentation, the resulting object contains the following three lists:从文档中可以看出，生成的对象包含以下三个列表：

list of (int, float) – Topic distribution for the whole document. list of (int, float) – 整个文档的主题分布。 Each element in the list is a pair of a topic's id, and the probability that was assigned to it.列表中的每个元素都是一对主题的 id 和分配给它的概率。

list of (int, list of (int, float), optional – Most probable topics per word. Each element in the list is a pair of a word's id, and a list of topics sorted by their relevance to this word. Only returned if per_word_topics was set to True. list of (int, list of (int, float), optional – 每个单词最可能的主题。列表中的每个元素是一对单词的 id，以及按与该单词的相关性排序的主题列表。仅在以下情况下返回per_word_topics 设置为 True。

list of (int, list of float), optional – Phi relevance values, multipled by the feature length, for each word-topic combination. list of (int, list of float), optional – Phi 相关值，乘以特征长度，用于每个词主题组合。 Each element in the list is a pair of a word's id and a list of the phi values between this word and each topic.列表中的每个元素都是一对单词的 id 和该单词与每个主题之间的 phi 值列表。 Only returned if per_word_topics was set to True.仅当 per_word_topics 设置为 True 时才返回。

Now,现在，

tops, probs = zip(*res[0])

probs will contains K (for you 4) probabilities. probs将包含 K 个（为您 4 个）概率。 Some may be zero, but they should sum up to 1有些可能为零，但它们的总和应为 1

Answer 2

您可以使用 minimum_probability 参数并将其设置为非常小的值，例如 0.000001。

topic_vector = [ x[1] for x in ldamodel.get_document_topics(new_doc_bow , minimum_probability= 0.0, per_word_topics=False)]

Answer 3

只要打字，

pd.DataFrame(lda_model.get_document_topics(doc_term_matrix))

从gensim LDA模型中提取Topic分布

问题描述

3 个解决方案

解决方案1
2 已采纳 2018-08-31 18:28:02

解决方案2
1 2018-12-23 04:48:26

解决方案3
0 2019-12-31 11:21:32

从gensim LDA模型中提取Topic分布

问题描述

3 个解决方案

解决方案1 2 已采纳 2018-08-31 18:28:02

解决方案2 1 2018-12-23 04:48:26

解决方案3 0 2019-12-31 11:21:32

解决方案1
2 已采纳 2018-08-31 18:28:02

解决方案2
1 2018-12-23 04:48:26

解决方案3
0 2019-12-31 11:21:32