简体   繁体   中英

how to improve topic model of gensim

I want to extract topics from articles, the test article is " https://julien.danjou.info/blog/2017/announcing-scaling-python ".

It's an aticle about python and scalling. I've tried lsi and lda, most of time , lda seems works better. But the output of both of them isn't stable.

Of course, the first three or five keywords seem to hit the target. "python", "book", 'project' ( I don't think 'project' should be an useful topic and will drop it in stopwords list.) , scaling or scalable or openstack should be in keywords list, but not stable at all.

Topic list and stopwords list might improve the results, but it's not scalable. I have to maintain different list for different domain.

So the question here, is there any better solution to improve the algorithm?

num_topics = 1
num_words = 10
passes = 20

lda model demo code, code of lsi is the same.

for topic in lda.print_topics(num_words=num_words):
    termNumber = topic[0]
    print(topic[0], ':', sep='')
    listOfTerms = topic[1].split('+')
    for term in listOfTerms:
        listItems = term.split('*')
        print('  ', listItems[1], '(', listItems[0], ')', sep='')
        lda_list.append(listItems[1])

Test Result 1

Dictionary(81 unique tokens: ['dig', 'shoot', 'lot', 'world', 'possible']...)
# lsi result
0:
  "python" (0.457)
  "book" ( 0.391)
  "project" ( 0.261)
  "like" ( 0.196)
  "application" ( 0.130)
  "topic" ( 0.130)
  "new" ( 0.130)
  "openstack" ( 0.130)
  "way" ( 0.130)
  "decided"( 0.130)

# lda result
0:
  "python" (0.041)
  "book" ( 0.036)
  "project" ( 0.026)
  "like" ( 0.021)
  "scalable" ( 0.015)
  "turn" ( 0.015)
  "working" ( 0.015)
  "openstack" ( 0.015)
  "scaling" ( 0.015)
  "different"( 0.015)

Test Result 2

Dictionary(81 unique tokens: ['happy', 'idea', 'tool', 'new', 'shoot']...)
# lsi result
0:
  "python" (0.457)
  "book" ( 0.391)
  "project" ( 0.261)
  "like" ( 0.196)
  "scaling" ( 0.130)
  "application" ( 0.130)
  "turn" ( 0.130)
  "working" ( 0.130)
  "openstack" ( 0.130)
  "topic"( 0.130)
# lda result
0:
  "python" (0.041)
  "book" ( 0.036)
  "project" ( 0.026)
  "like" ( 0.021)
  "decided" ( 0.015)
  "different" ( 0.015)
  "turn" ( 0.015)
  "writing" ( 0.015)
  "working" ( 0.015)
  "application"( 0.015)

If I understand correctly, you have an article and want your model to explain to you what it is about.

But if I didn't misunderstand something you train your LDA model on that one single document with one topic. So afterall, you are not really extracting the topics since you only have one topic. I don't think that's how LDA was intented to be used. Generally you will want to train your model on a large corpus (collection of documents), like all English Wikipedia articles or all articles from a journal from the past 60 years using some two or three digit topic number. That's typically when LDA starts to gain power.

Often when I try to "understand" a document by understanding its topic distribution I will train the model on a large corpus, not necessarily directly connected to the document I am trying to query. That is especially useful in cases when your documents are few and/or short, like in your case.

If you expect your document to be diverse in topics, you could train LDA on the English Wikipedia (that gives your topics from ['apple', 'banana',...] to ['regression', 'probit',...]).
If you know that all documents you want to query lie in a particular field, maybe training LDA on a corpus from this field will lead to a lot better results, because the topics related to the field will be much more precisely separated. In your case, you could train a LDA model on several dozen/hundreds Python related books and articles. But that all depends on your goals.

Then you can always play around with the number of topics. For very large corpora you can try 100, 200 and even 1000 topics. For smaller ones maybe 5 or 10.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM