简体   繁体   中英

Topic Modeling Using Gensim in Python

I have a list of bag of words for two classes. Say n items in class A and m items in class B . I want to use the topic modeling with gensim package (for LDA) in python in order to train a model for class A vs class B. Meanwhile I am new to both Topic Modeling and Python . Does anyone know how should I do this? I mean, should I merge all the bags for each class and the use gensim or should I use bag for each item seperately? Thanks!

If I understand you correctly you want to compare documents from two sources.

One way to do this with Gensim would be:

  • create bag of words corpus from all documents (A and B) (~convert texts to an X n matrix of ones and zeroes)
  • train LDA model on your corpus (~ find the topics)
  • convert corpus to LDA space (~ determine which topics are relevant for the documents)

Now you can see topics distributions for each documents and determine how similar two documents are using Gensim's similarity methods.

For details take a look at Gensim's tutorials . The only modification you'd need to make would be to combine your documents from A and B into one bigger document and save the indices somewhere so that you can compare them easily later.

However, depending on your data and your goal, other forms of LDA (such as correlated topics models) may be more suitable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM