简体   繁体   English

在Gensim LDA中记录主题分布

[英]Document topical distribution in Gensim LDA

I've derived a LDA topic model using a toy corpus as follows: 我使用玩具语料库得出了一个LDA主题模型,如下所示:

documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)

id2word = {}
for word in dictionary.token2id:    
    id2word[dictionary.token2id[word]] = word

I found that when I use a small number of topics to derive the model, Gensim yields a full report of topical distribution over all potential topics for a test document. 我发现当我使用少量主题来推导模型时,Gensim会生成一份关于测试文档所有潜在主题的主题分布的完整报告。 Eg: 例如:

test_lda = LdaModel(corpus,num_topics=5, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]

Out[314]: [(0, 0.59751626959781134),
(1, 0.10001902477790173),
(2, 0.10001375856907335),
(3, 0.10005453508763221),
(4, 0.10239641196758137)]

However when I use a large number of topics, the report is no longer complete: 但是,当我使用大量主题时,报告不再完整:

test_lda = LdaModel(corpus,num_topics=100, id2word=id2word)

test_lda[dictionary.doc2bow('human system')]
Out[315]: [(73, 0.50499999999997613)]

It seems to me that topics with a probability less than some threshold (I observed 0.01 to be more specific) are omitted form the output. 在我看来,输出中省略了概率小于某个阈值的主题(我观察到0.01更具体)。

I'm wondering if this behaviour is due to some aesthetic considerations? 我想知道这种行为是否是由于某些美学考虑因素造成的? And how can I get the distribution of the probability mass residual over all other topics? 如何在所有其他主题上获得概率质量残差的分布?

Thank you for your kind answer! 谢谢你的回答!

Read the source and it turns out that topics with probabilities smaller than a threshold are ignored. 阅读 ,结果发现概率小于阈值的主题被忽略。 This threshold is with a default value of 0.01. 此阈值的默认值为0.01。

I realise this is an old question but in case someone stumbles upon it, here is a solution (the issue has actually been fixed in the current development branch with a minimum_probability parameter to LdaModel but maybe you're running an older version of gensim). 我意识到这是一个老问题,但万一有人偶然发现它,这里有一个解决方案(这个问题实际上已经在当前开发分支中修复了一个带有minimum_probability参数的LdaModel但也许你正在运行旧版本的gensim)。

define a new function (this is just copied from the source) 定义一个新函数(这只是从源代码复制)

def get_doc_topics(lda, bow):
    gamma, _ = lda.inference([bow])
    topic_dist = gamma[0] / sum(gamma[0])  # normalize distribution
    return [(topicid, topicvalue) for topicid, topicvalue in enumerate(topic_dist)]

the above function does not filter the output topics based on the probability but will output all of them. 上述函数不会根据概率过滤输出主题,但会输出所有这些主题。 If you don't need the (topic_id, value) tuples but just values, just return the topic_dist instead of the list comprehension (it'll be much faster as well). 如果你不需要(topic_id, value)元组而只需要值,只需返回topic_dist而不是list comprehension(它也会更快)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM