简体繁体 English

在Python中使用Gensim进行主题建模

[英]Topic Modeling Using Gensim in Python

原文 2014-12-05 03:10:43 5 1 python/ machine-learning/ nlp/ lda/ gensim

I have a list of bag of words for two classes. 我有两节课的单词表。 Say n items in class A and m items in class B . 说出A类中的 n个项目和B类中的 m个项目。 I want to use the topic modeling with gensim package (for LDA) in python in order to train a model for class A vs class B. Meanwhile I am new to both Topic Modeling and Python . 我想在python中使用带有gensim包（适用于LDA）的主题建模，以便为A类与B类进行训练。同时，我对Topic Modeling和Python还是陌生的 。 Does anyone know how should I do this? 有人知道我该怎么做吗？ I mean, should I merge all the bags for each class and the use gensim or should I use bag for each item seperately? 我的意思是，我应该合并每个班级和使用gensim的所有袋子吗？还是应该分别为每个项目使用袋子？ Thanks! 谢谢！

1 个解决方案

If I understand you correctly you want to compare documents from two sources. 如果我对您的理解正确，则希望比较两个来源的文档。

One way to do this with Gensim would be: 用Gensim做到这一点的一种方法是：

create bag of words corpus from all documents (A and B) (~convert texts to an X n matrix of ones and zeroes) 从所有文档（A和B）创建单词语料库（〜将文本转换为1和0的X n矩阵）
train LDA model on your corpus (~ find the topics) 在您的语料库上训练LDA模型（〜查找主题）
convert corpus to LDA space (~ determine which topics are relevant for the documents) 将语料库转换为LDA空间（〜确定与文档相关的主题）

Now you can see topics distributions for each documents and determine how similar two documents are using Gensim's similarity methods. 现在，您可以查看每个文档的主题分布，并使用Gensim的相似度方法确定两个文档的相似度。

For details take a look at Gensim's tutorials . 有关详细信息，请参阅Gensim的教程。 The only modification you'd need to make would be to combine your documents from A and B into one bigger document and save the indices somewhere so that you can compare them easily later. 您唯一需要做的修改就是将A和B中的文档合并为一个更大的文档，并将索引保存在某个位置，以便以后可以轻松比较它们。

However, depending on your data and your goal, other forms of LDA (such as correlated topics models) may be more suitable. 但是，根据您的数据和目标，其他形式的LDA（例如相关主题模型）可能更合适。