简体   繁体   English

如何改进gensim的主题模型

[英]how to improve topic model of gensim

I want to extract topics from articles, the test article is " https://julien.danjou.info/blog/2017/announcing-scaling-python ". 我想从文章中提取主题,测试文章为“ https://julien.danjou.info/blog/2017/announcing-scaling-python ”。

It's an aticle about python and scalling. 这是关于python和缩放比例的书。 I've tried lsi and lda, most of time , lda seems works better. 我尝试过lsi和lda,大多数情况下,lda似乎效果更好。 But the output of both of them isn't stable. 但是它们两者的输出都不稳定。

Of course, the first three or five keywords seem to hit the target. 当然,前三个或五个关键字似乎可以达到目标。 "python", "book", 'project' ( I don't think 'project' should be an useful topic and will drop it in stopwords list.) , scaling or scalable or openstack should be in keywords list, but not stable at all. “ python”,“ book”,“ project”(我不认为'project'应该是有用的主题,并且将其放在停用词列表中。),scale,scalable或openstack应该在关键字列表中,但在所有。

Topic list and stopwords list might improve the results, but it's not scalable. 主题列表和停用词列表可能会改善结果,但它不可扩展。 I have to maintain different list for different domain. 我必须为不同的域维护不同的列表。

So the question here, is there any better solution to improve the algorithm? 所以这里的问题是,有没有更好的解决方案来改进算法?

num_topics = 1
num_words = 10
passes = 20

lda model demo code, code of lsi is the same. lda模型的演示代码,与lsi的代码相同。

for topic in lda.print_topics(num_words=num_words):
    termNumber = topic[0]
    print(topic[0], ':', sep='')
    listOfTerms = topic[1].split('+')
    for term in listOfTerms:
        listItems = term.split('*')
        print('  ', listItems[1], '(', listItems[0], ')', sep='')
        lda_list.append(listItems[1])

Test Result 1 测试结果1

Dictionary(81 unique tokens: ['dig', 'shoot', 'lot', 'world', 'possible']...)
# lsi result
0:
  "python" (0.457)
  "book" ( 0.391)
  "project" ( 0.261)
  "like" ( 0.196)
  "application" ( 0.130)
  "topic" ( 0.130)
  "new" ( 0.130)
  "openstack" ( 0.130)
  "way" ( 0.130)
  "decided"( 0.130)

# lda result
0:
  "python" (0.041)
  "book" ( 0.036)
  "project" ( 0.026)
  "like" ( 0.021)
  "scalable" ( 0.015)
  "turn" ( 0.015)
  "working" ( 0.015)
  "openstack" ( 0.015)
  "scaling" ( 0.015)
  "different"( 0.015)

Test Result 2 测试结果2

Dictionary(81 unique tokens: ['happy', 'idea', 'tool', 'new', 'shoot']...)
# lsi result
0:
  "python" (0.457)
  "book" ( 0.391)
  "project" ( 0.261)
  "like" ( 0.196)
  "scaling" ( 0.130)
  "application" ( 0.130)
  "turn" ( 0.130)
  "working" ( 0.130)
  "openstack" ( 0.130)
  "topic"( 0.130)
# lda result
0:
  "python" (0.041)
  "book" ( 0.036)
  "project" ( 0.026)
  "like" ( 0.021)
  "decided" ( 0.015)
  "different" ( 0.015)
  "turn" ( 0.015)
  "writing" ( 0.015)
  "working" ( 0.015)
  "application"( 0.015)

If I understand correctly, you have an article and want your model to explain to you what it is about. 如果我理解正确,那么您有一篇文章,希望您的模型向您解释它的含义。

But if I didn't misunderstand something you train your LDA model on that one single document with one topic. 但是,如果我没有误会,您可以在一个主题下的单个文档中训练LDA模型。 So afterall, you are not really extracting the topics since you only have one topic. 因此,毕竟,您只有一个主题,所以您实际上并没有提取主题。 I don't think that's how LDA was intented to be used. 我认为这不是LDA打算使用的方式。 Generally you will want to train your model on a large corpus (collection of documents), like all English Wikipedia articles or all articles from a journal from the past 60 years using some two or three digit topic number. 通常,您将需要在大型语料库(文档集合)上训练模型,例如所有英文Wikipedia文章或过去60年来期刊中的所有文章(使用大约两位或三位数的主题编号)。 That's typically when LDA starts to gain power. 通常是在LDA开始获得功率时。

Often when I try to "understand" a document by understanding its topic distribution I will train the model on a large corpus, not necessarily directly connected to the document I am trying to query. 通常,当我尝试通过理解文档的主题分布来“理解”文档时,我会在大型语料库上训练模型,而不必直接连接到我要查询的文档。 That is especially useful in cases when your documents are few and/or short, like in your case. 在您的文档很少和/或很短的情况下(例如您的情况),这尤其有用。

If you expect your document to be diverse in topics, you could train LDA on the English Wikipedia (that gives your topics from ['apple', 'banana',...] to ['regression', 'probit',...]). 如果您希望文档的主题多样化,则可以在英语Wikipedia上对LDA进行培训(将您的主题从['apple','banana',...]转换为['regression','probit',.. 。])。
If you know that all documents you want to query lie in a particular field, maybe training LDA on a corpus from this field will lead to a lot better results, because the topics related to the field will be much more precisely separated. 如果您知道要查询的所有文档都位于一个特定的字段中,那么也许可以使用该字段上的语料库对LDA进行训练将获得更好的结果,因为与该字段相关的主题将被更精确地分离。 In your case, you could train a LDA model on several dozen/hundreds Python related books and articles. 在您的情况下,您可以在数十本与Python相关的书籍和文章中训练LDA模型。 But that all depends on your goals. 但这一切都取决于您的目标。

Then you can always play around with the number of topics. 这样,您就可以随时处理多个主题。 For very large corpora you can try 100, 200 and even 1000 topics. 对于大型语料库,您可以尝试100、200甚至1000个主题。 For smaller ones maybe 5 or 10. 对于较小的,可能是5或10。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM