简体   繁体   English

从gensim LDA模型中提取Topic分布

[英]Extracting Topic distribution from gensim LDA model

I created an LDA model for some text files using gensim package in python.我在 python 中使用 gensim 包为一些文本文件创建了一个 LDA 模型。 I want to get topic's distributions for the learned model.我想获取学习模型的主题分布。 Is there any method in gensim ldamodel class or a solution to get topic's distributions from the model? gensim ldamodel 类中是否有任何方法或从模型中获取主题分布的解决方案? For example, I use the coherence model to find a model with the best cohrence value subject to the number of topics in range 1 to 5. After getting the best model I use get_document_topics method (thanks kenhbs ) to get topic distribution in the document that used for creating the model.例如,我使用一致性模型来查找具有最佳一致性值的模型,主题数量范围为 1 到 5。在获得最佳模型后,我使用 get_document_topics 方法(感谢kenhbs )获取文档中的主题分布用于创建模型。

id2word = corpora.Dictionary(doc_terms)

bow = id2word.doc2bow(doc_terms)

max_coherence = -1
best_lda_model = None

for num_topics in range(1, 6):

    lda_model = gensim.models.ldamodel.LdaModel(corpus=bow, num_topics=num_topics)

    coherence_model = gensim.models.CoherenceModel(model=lda_model, texts=doc_terms,dictionary=id2word)

    coherence_value = coherence_model.get_coherence()

    if coherence_value > max_coherence:
        max_coherence = coherence_value
        best_lda_model = lda_model

The best has 4 topics最好有4个主题

print(best_lda_model.num_topics)

4

But when I use get_document_topics, I get less than 4 values for document distribution.但是当我使用 get_document_topics 时,我得到的文档分发值少于 4 个。

topic_ditrs = best_lda_model.get_document_topics(bow)

print(len(topic_ditrs))

3

My question is: For best lda model with 4 topics (using coherence model) for a document, why get_document_topics returns fewer topics for the same document?我的问题是:对于具有 4 个主题(使用一致性模型)的文档的最佳 lda 模型,为什么 get_document_topics 为同一文档返回较少的主题? why some topics have very small distribution (less than 1-e8)?为什么有些主题的分布很小(小于 1-e8)?

From the documentation , you can use two methods for this.文档中,您可以使用两种方法。

If you are aiming to get the main terms in a specific topic, use get_topic_terms :如果您的目标是获取特定主题中的主要术语,请使用get_topic_terms

from gensim.model.ldamodel import LdaModel

K = 10
lda = LdaModel(some_corpus, num_topics=K)

lda.get_topic_terms(5, topn=10)
# Or for all topics
for i in range(K):
    lda.get_topic_terms(i, topn=10)

You can also print the entire underlying np.ndarray (called either beta or phi in standard LDA papers, dimensions are (K, V) or (V, K)).您还可以打印整个底层np.ndarray (在标准 LDA 论文中称为 beta 或 phi,维度为 (K, V) 或 (V, K))。

phi = lda.get_topics()

edit : From the link i included in the original answer: if you are looking for a document's topic distribution, use编辑:从我包含在原始答案中的链接:如果您正在寻找文档的主题分布,请使用

res = lda.get_document_topics(bow)

As can be read from the documentation, the resulting object contains the following three lists:从文档中可以看出,生成的对象包含以下三个列表:

  • list of (int, float) – Topic distribution for the whole document. list of (int, float) – 整个文档的主题分布。 Each element in the list is a pair of a topic's id, and the probability that was assigned to it.列表中的每个元素都是一对主题的 id 和分配给它的概率。

  • list of (int, list of (int, float), optional – Most probable topics per word. Each element in the list is a pair of a word's id, and a list of topics sorted by their relevance to this word. Only returned if per_word_topics was set to True. list of (int, list of (int, float), optional – 每个单词最可能的主题。列表中的每个元素是一对单词的 id,以及按与该单词的相关性排序的主题列表。仅在以下情况下返回per_word_topics 设置为 True。

  • list of (int, list of float), optional – Phi relevance values, multipled by the feature length, for each word-topic combination. list of (int, list of float), optional – Phi 相关值,乘以特征长度,用于每个词主题组合。 Each element in the list is a pair of a word's id and a list of the phi values between this word and each topic.列表中的每个元素都是一对单词的 id 和该单词与每个主题之间的 phi 值列表。 Only returned if per_word_topics was set to True.仅当 per_word_topics 设置为 True 时才返回。

Now,现在,

tops, probs = zip(*res[0])

probs will contains K (for you 4) probabilities. probs将包含 K 个(为您 4 个)概率。 Some may be zero, but they should sum up to 1有些可能为零,但它们的总和应为 1

您可以使用 minimum_probability 参数并将其设置为非常小的值,例如 0.000001。

topic_vector = [ x[1] for x in ldamodel.get_document_topics(new_doc_bow , minimum_probability= 0.0, per_word_topics=False)]

只要打字,

pd.DataFrame(lda_model.get_document_topics(doc_term_matrix))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM