LDA：主题 model gensim 给出了相同的主题集

Question

Why am I getting same set of topics # words in gensim lda model?为什么我在 gensim lda model 中得到相同的主题集#字？ I used these parameters.我使用了这些参数。 I checked there are no duplicate documents in my corpus.我检查了我的语料库中没有重复的文档。

lda_model = gensim.models.ldamodel.LdaModel(corpus=MY_CORPUS,
                                           id2word=WORD_AND_ID,
                                           num_topics=4, 
                                           minimum_probability=minimum_probability,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto', # symmetric, asymmetric
                                           per_word_topics=True)

Results结果

[
(0, '0.004*lily + 0.01*rose + 0.00*jasmine'),
(1, '0.005*geometry + 0.07*algebra + 0.01*calculation'),
(2, '0.003*painting + 0.001*brush + 0.01*colors'),
(3, '0.005*geometry + 0.07*algebra + 0.01*calculation')
]

Notice: Topic #1 and #3 are identical.注意：主题#1 和#3 相同。

Answer 1

Each of the topics likely contains a large number of words weighted differently.每个主题都可能包含大量不同权重的单词。 When a topic is being displayed (eg using lda_model.show_topics() ) you are going to get only a few words with the largest weights.当一个主题被显示时（例如使用lda_model.show_topics() ）你只会得到几个具有最大权重的词。 This does not mean that there are no differences between topics among the remaining vocabulary.这并不意味着剩余词汇之间的主题之间没有差异。

You can steer the number of displayed words to inspect the remaining weights:您可以控制显示的单词数来检查剩余的权重：

 show_topics(num_topics=4, num_words=10, log=False, formatted=True)

and change num_words parameter to include even more words.并更改num_words参数以包含更多单词。

Now, there is also a possibility that:现在，还有一种可能：

the number of topics should be different (eg 3),主题的数量应该不同（例如 3 个），
or minimum_probability smaller (what is the value you use?),或minimum_probability更小（您使用的值是多少？），
or number of passes larger,或更大的passes次数，
chunksize smaller, chunksize更小，
corpus larger (what is the size?) or stripped off of stop words (did you do that?).语料库更大（大小是多少？）或去掉停用词（你这样做了吗？）。

I encourage you to experiment with different values of these parameters to check if any of the combination works better.我鼓励您尝试使用这些参数的不同值，以检查是否有任何组合效果更好。

LDA：主题 model gensim 给出了相同的主题集

问题描述

Results结果

1 个解决方案

解决方案1
1 已采纳 2021-01-21 06:19:20

LDA：主题 model gensim 给出了相同的主题集

问题描述

Results结果

1 个解决方案

解决方案1 1 已采纳 2021-01-21 06:19:20

解决方案1
1 已采纳 2021-01-21 06:19:20