简体   繁体   English

gensim LDA:如何为每个主题生成带有不同词的主题?

[英]gensim LDA: How can i generate topics with different words for each topic?

I'm using the LDA algorithm from the gensim package to find topics in a given text. 我正在使用gensim包中的LDA算法来查找给定文本中的主题。

I've been asked that the resulting topics will include different words for each topic, EG If topic A has the word 'monkey' in it then no other topic should include the word 'monkey' in its list. 我被问到结果主题将为每个主题(EG)包含不同的词,如果主题A中包含单词“ monkey”,则其他任何主题都不应在列表中包含单词“ monkey”。

My thoughts so far: run it multiple times and each time add the previous words to the stop words list. 到目前为止,我的想法是:运行多次,每次将前一个单词添加到停用词列表中。

Since: A) I'm not even sure of algorithmically/logically it's the right thing to do. 因为:A)我什至不确定算法/逻辑上的正确做法。 B) I hope there's a built in way to do it that i'm not aware of. B)我希望有一种我不知道的内置方法。 C) This is a large database, and it takes about 20 minutes to run the LDA each time (using the multi-core version). C)这是一个大型数据库,每次运行LDA大约需要20分钟(使用多核版本)。

Question: Is there a better way to do it? 问题:有更好的方法吗?

Hope to get some help, 希望能得到一些帮助,

Thanks. 谢谢。

LDA provides for each topic and each word a probability that the topic generates that word. LDA为每个主题和每个单词提供主题生成该单词的概率。 You can try assigning words to topics by just taking the max over all topics of the probability to generate the word. 您可以尝试将单词分配给主题,方法是仅对生成单词的概率的所有主题取最大值。 In other words if topic A generates "monkey" with probability 0.01 and topic B generates the word monkey with probability 0.02 then you can assign the word monkey to topic B. 换句话说,如果主题A生成概率为0.01的“猴子”,主题B生成概率为0.02的猴子,则可以将猴子这个词分配给主题B。

I think what you want to do is logically incorrect. 我认为您想要做的在逻辑上是不正确的。 Take for example a word like "bank" which has two different meaning("river bank" or "money bank") depending on the context. 以诸如“银行”之类的词为例,该词根据上下文具有两种不同的含义(“河岸”或“货币银行”)。 When you intentionally remove the word from one topic words it's probable that you lose the topic meaning(specially when the probability of that word is high). 当您有意从一个主题词中删除该词时,很可能会失去主题含义(特别是当该词的可能性很高时)。 Take a look at this: 看看这个:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb

i think the only remaining option(if it's even be rational to do) is to use the probabilities of words in topics. 我认为唯一剩下的选择(如果这样做甚至很合理)是使用主题中单词的概率。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM