简体   繁体   English

潜在语义索引主题的数量

[英]Number of Latent Semantic Indexing topics

I'm using gensim's package to implement LSI on a corpus. 我正在使用gensim的软件包在语料库上实现LSI。 My goal is to find out the most frequently occurring distinct topics that appear in the corpus. 我的目标是找出出现在语料库中的最常出现的不同主题。

If I don't know the number of topics that are in the corpus (I'd estimate anywhere from 5 to 20), what is the best approach in setting the number of topics that LSI should search for? 如果我不知道语料库中的主题数量(我估计在5到20之间),那么设置LSI应该搜索的主题数量的最佳方法是什么? Is it better to look for a large number of topics (20-30), or a small number of topics (~5)? 寻找大量主题(20-30)或少数主题(~5)更好吗?

From Radim himself: 来自Radim本人:

that's a good question, but unfortunately without a good answer. 这是一个很好的问题,但遗憾的是没有一个好的答案。

It is not true that increasing the number of dimensions always improves retrieval accuracy. 增加维度的数量总是提高检索准确性。 In fact, if you use all the dimensions (=full rank of the training matrix), LSI will give you exactly the same documents that you entered in, so LSI would become pointless. 事实上,如果你使用所有维度(=训练矩阵的满级),LSI将为你提供与你输入的文件完全相同的文件,因此LSI将变得毫无意义。

If you're interested in the math side of it, have a look at this issue: https://github.com/piskvorky/gensim/issues/28 Otherwise, just set the dimensions to a few hundred~thousand which is the accepted standard. 如果您对它的数学方面感兴趣,请看一下这个问题: https//github.com/piskvorky/gensim/issues/28否则,只需将尺寸设置为几百到几千即可接受标准。 Or try several different choices, measure the accuracy and select dimensionality that works the best on your problem. 或者尝试几种不同的选择,测量准确度并选择最适合您问题的维度。

Best, Radim 最好的,Radim

This is what I do sometimes when I'm confused. 当我困惑时,这就是我有时会做的事情。 Since you've already narrowed down to your topics from 5-20, you can iterate b/w some of these values and see which value fits the best. 由于您已经从5-20缩小到主题,因此您可以迭代b / w其中一些值并查看哪个值最合适。

##Declare values for N_TOPICS
for i in lda.show_topics(topics=-N_TOPICS, topn=20, log=False, formatted=True): 
  print "TOPIC {0}: {1}\n".format(count, i) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM