简体繁体 English

从 Gensim LDA 或 pyLDAvis 中提取词显着性

[英]Extract Word Saliency from Gensim LDA or pyLDAvis

原文 2021-10-15 01:46:19 0 1 gensim/ lda/ topic-modeling/ pyldavis

I see that pyLDAvis visualize each word's saliency under each topic.我看到 pyLDAvis 可视化了每个主题下每个单词的显着性。

But do we have a way to extract each word's saliency under each topic?但是我们有没有办法提取每个主题下每个单词的显着性？ Or how to calculate each word's saliency directly using Gensim LDA?或者如何直接使用Gensim LDA计算每个单词的显着性？

So finally, I want to get a pandas dataframe such that one row represents one word, each column represents each topic and its value represents the word's saliency under the corresponding topic.所以最后，我想得到一个 Pandas 数据框，一行代表一个词，每一列代表每个主题，它的值代表相应主题下词的显着性。

Many thanks in advance.提前谢谢了。

1 个解决方案

Gensim's LDA support does not have out-of-the-box support for this particular 'saliency' calculation from Chuang et al (2012). Gensim 的 LDA 支持对 Chuang 等人 (2012) 的这种特殊“显着性”计算没有开箱即用的支持。

Still, I suspect the model's .get_term_topics() and/or .get_topic_terms() methods are the proper supporting data for implementing that calculation.不过，我怀疑模型的.get_term_topics()和/或.get_topic_terms()方法是实现该计算的正确支持数据。 In particular, one or the other of those methods might provide the p( w | t ) term, but a deeper read of the paper would be required to know for sure.特别是，这些方法中的一种或另一种可能提供p( w | t )项，但需要更深入地阅读论文才能确定。 (I suspect the P(t) term might require a separate survey of the training data.) （我怀疑P(t)项可能需要对训练数据进行单独调查。）

From the class docs:来自课堂文档：

https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_term_topics https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_term_topics

Returns The relevant topics represented as pairs of their ID and their assigned probability, sorted by relevance to the given word.返回相关主题表示为它们的 ID 和它们分配的概率的对，按与给定单词的相关性排序。

https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_topic_terms https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_topic_terms

Returns Word ID - probability pairs for the most relevant words generated by the topic.返回词 ID - 主题生成的最相关词的概率对。

I hadn't come across this particular 'saliency' calculation before, but if it is popular among LDA users, or of potential general use, and you figure out how to calculate it, it'd likely be a welcome contribution to the Gensim project - especially if it can be a simple extra convenience method on LdaModel .我之前没有遇到过这种特殊的“显着性”计算，但是如果它在 LDA 用户中很受欢迎，或者具有潜在的普遍用途，并且您知道如何计算它，那么它可能是对 Gensim 项目的一个受欢迎的贡献- 特别是如果它可以是LdaModel上一个简单的额外方便的方法。