简体繁体 English

为主题建模 (LDA) 计算最佳主题数

[英]Calculating optimal number of topics for topic modeling (LDA)

原文 2021-04-16 17:46:39 4 1 python/ nlp/ lda/ topic-modeling

I am going to do topic modeling via LDA.我将通过 LDA 进行主题建模。 I run my commands to see the optimal number of topics.我运行我的命令来查看最佳主题数量。 The output was as follows: It is a bit different from any other plots that I have ever seen. output 如下：它与我见过的任何其他地块都有点不同。 Do you think it is okay?你觉得可以吗？ or it is better to use other algorithms rather than LDA.或者最好使用其他算法而不是LDA。 It is worth mentioning that when I run my commands to visualize the topics-keywords for 10 topics, the plot shows 2 main topics and the others had almost a strong overlap.值得一提的是，当我运行命令来可视化 10 个主题的主题关键字时，plot 显示了 2 个主要主题，其他主题几乎有很强的重叠。 Is there any valid range for coherence?是否有任何有效的一致性范围？

Many thanks to share your comments as I am a beginner in topic modeling.非常感谢您分享您的评论，因为我是主题建模的初学者。

1 个解决方案

Shameless self-promotion: I suggest you use the OCTIS library: https://github.com/mind-Lab/octis It allows you to run different topic models and optimize their hyperparameters (also the number of topics) in order to select the best result.无耻的自我推销：我建议你使用 OCTIS 库： https://github.com/mind-Lab/octis它允许你运行不同的主题模型并优化它们的超参数（也就是主题的数量）以达到 select最好的结果。

There might be many reasons why you get those results.您获得这些结果的原因可能有很多。 But here some hints and observations:但这里有一些提示和观察：

Make sure that you've preprocessed the text appropriately.确保您已对文本进行了适当的预处理。 This usually includes removing punctuation and numbers, removing stopwords and words that are too frequent or rare, (optionally) lemmatizing the text.这通常包括删除标点符号和数字，删除过于频繁或罕见的停用词和单词，（可选）对文本进行词形还原。 Preprocessing is dependent on the language and the domain of the texts.预处理取决于文本的语言和领域。
LDA is a probabilistic model, which means that if you re-train it with the same hyperparameters, you will get different results each time. LDA 是一个概率 model，这意味着如果你用相同的超参数重新训练它，你每次都会得到不同的结果。 A good practice is to run the model with the same number of topics multiple times and then average the topic coherence.一个好的做法是多次运行具有相同主题数量的 model，然后平均主题连贯性。
There are a lot of topic models and LDA works usually fine.有很多主题模型，LDA 通常可以正常工作。 The choice of the topic model depends on the data that you have.主题 model 的选择取决于您拥有的数据。 For example, if you are working with tweets (ie short texts), I wouldn't recommend using LDA because it cannot handle well sparse texts.例如，如果您正在处理推文（即短文本），我不建议使用 LDA，因为它不能很好地处理稀疏文本。
Check how you set the hyperparameters.检查您如何设置超参数。 They may have a huge impact on the performance of the topic model.它们可能会对主题 model 的性能产生巨大影响。
The range for coherence (I assume you used NPMI which is the most well-known) is between -1 and 1, but values very close to the upper and lower bound are quite rare.连贯性的范围（我假设您使用的是最著名的 NPMI）在 -1 和 1 之间，但非常接近上限和下限的值非常罕见。