简体繁体 English

从 LDA 主题建模创建更多相关结果？

[英]Creating more relevant results from LDA topic modeling?

原文 2022-02-27 22:56:23 3 1 nltk/ gensim/ lda/ topic-modeling/ word-cloud

I am doing a project for my degree and I have an actual client from another college.我正在为我的学位做一个项目，我有一个来自另一所大学的实际客户。 They want me to do all this stuff with topic modeling to an sql file of paper abstracts he's given me.他们希望我对他给我的 sql 论文摘要文件进行主题建模来完成所有这些工作。 I have zero experience with topic modeling but I've been using Gensim and Nlkt in a Jupyter notebook for this.我对主题建模的经验为零，但为此我一直在 Jupyter 笔记本中使用 Gensim 和 Nlkt。

What he want's right now is for me to generate 10 or more topics, record the top 10 most overall common words from the LDA's results, and then if they are very frequent in each topic, remove them from the resulting word cloud and if they are more variant, remove the words from just the topics where they are infrequent and keep them in the more relevant topics.他现在要的是让我生成10个以上的topic，记录LDA的结果中最普遍的前10个词，然后如果在每个topic中出现频率很高，就把它们从结果词云中去掉，如果是更多变体，只从不常出现的主题中删除单词，并将它们保留在更相关的主题中。

He also wants me to compare the frequency of each topic from the sql files of other years.他还想让我从其他年份的 sql 文件中比较每个主题的出现频率。 And, he wants these topics to have a name generated smartly from the computer.而且，他希望这些主题有一个由计算机智能生成的名称。

I have topic models per year and overall, but of course they do not appear exactly the same way in each year.我每年和总体上都有主题模型，但当然它们每年的出现方式都不完全相同。 My biggest concern is the first thing he wants with the removal process.我最担心的是他首先想要的是移除过程。 Is any of this possible?这有可能吗？ I need help figuring out where to look as google is giving me not what I want as I am probably searching it wrong.我需要帮助弄清楚在哪里看，因为谷歌没有给我我想要的东西，因为我可能搜索错了。

Thank you!谢谢！

1 个解决方案

Show some of the code you use so we can give you more useful tips.显示您使用的一些代码，以便我们可以为您提供更多有用的提示。 Also use nlp tag, the tags you used are kind of specific and not followed by many people so your question might be hard to find for the relevant users.也使用nlp标签，你使用的标签是特定的，没有很多人关注，所以你的问题可能很难被相关用户找到。

By the whole word-removal thing do you mean stop words too?整个单词删除的意思是停用词吗？ Or did you already remove those?或者你已经删除了那些？ Stop words are very common words ("the", "it", "me" etc.) which often appear high in most frequent word lists but do not really have any meaning for finding topics.停用词是非常常见的词（“the”、“it”、“me”等），它们经常出现在最常见的词列表中，但实际上对查找主题没有任何意义。

First you remove the stop words to make the most common words list more useful.首先，您删除停用词以使最常用的单词列表更有用。

Then, as he requested, you look which (more common) words are common in ALL the topics (I can imagine in case of abstracts this is stuff like hypothesis, research, paper, results etc., so stuff that is abstract-specific but not useful for determining topics within different abstracts and remove those. I can imagine for this kind of analysis as well as the initial LDA it makes sense to use all the data from all years to have a large amount of data for the model to recognize patterns. But you should try around the variations and see if the per year or overall versions get you nicer results.然后，按照他的要求，你看看哪些（更常见的）词在所有主题中都很常见（我可以想象在摘要的情况下，这是假设、研究、论文、结果等东西，所以这些东西是抽象的，但对于确定不同摘要中的主题并删除它们没有用。我可以想象对于这种分析以及最初的 LDA，使用所有年份的所有数据来为 model 提供大量数据来识别模式是有意义的. 但是您应该尝试各种变体，看看每年或整体版本是否能让您获得更好的结果。

After you have your global word lists per topic you go back to the original data (split up by year) to count the frequencies of how often the combined words from a topic occur per year.在获得每个主题的全局单词列表后，您 go 返回原始数据（按年拆分）以计算每年某个主题的组合单词出现的频率。 If you view this over the years you probably can see trends like some topics that are popular in the last few years/now but if you go back far enough they werent relevant.如果你多年来查看这个，你可能会看到一些趋势，比如过去几年/现在流行的一些主题，但如果你 go 回溯得足够远，它们就不相关了。

The last thing you mentioned (automatically assigning labels to topics) is actually something quite tricky, depending on how you go about it.您提到的最后一件事（自动为主题分配标签）实际上是一件非常棘手的事情，具体取决于您 go 对此的看法。

The "easy" way would be eg just use the most frequent word in each topic as label but the results will probably be underwhelming. “简单”的方法是，例如，只使用每个主题中出现频率最高的词，如 label，但结果可能不会令人印象深刻。

A more advanced approach is Topic Labeling.一种更高级的方法是主题标签。 Or you can try an approach like modified text summarization using more powerful models.或者您可以尝试使用更强大的模型修改文本摘要等方法。