简体繁体 English

scikit-learn - 我应该使用TF或TF-IDF模型吗？

[英]scikit-learn - Should I fit model with TF or TF-IDF?

原文 2016-10-21 07:55:59 8 1 python/ scikit-learn/ tf-idf/ matrix-factorization/ latent-semantic-indexing

I am trying to find out the best way to fit different probabilistic models (like Latent Dirichlet Allocation, Non-negative Matrix Factorization, etc) on sklearn (Python). 我试图在sklearn（Python）上找到适合不同概率模型（如Latent Dirichlet Allocation，Non-negative Matrix Factorization等）的最佳方法。

Looking at the example in the sklearn documentation, I was wondering why the LDA model is fit on a TF array, while the NMF model is fit on a TF-IDF array. 看一下sklearn文档中的示例，我想知道为什么LDA模型适合TF阵列，而NMF模型适合TF-IDF阵列。 Is there a precise reason for this choice? 这个选择有确切的原因吗？

Here is the example: http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py 以下是示例： http ： //scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py

Also, any tips about how to find the best parameters (number of iterations, number of topics...) for fitting my models is well accepted. 此外，有关如何找到适合我的模型的最佳参数（迭代次数，主题数...）的任何提示都被广泛接受。

Thank you in advance. 先感谢您。

1 个解决方案

To make the answer clear one must first examine the definitions of the two models. 为了使答案清楚，首先必须检查两个模型的定义。

LDA is a probabilistic generative model that generates documents by sampling a topic for each word and then a word from the sampled topic. LDA是一种概率生成模型，通过对每个单词的主题进行抽样，然后从抽样主题中抽取一个单词来生成文档。 The generated document is represented as a bag of words. 生成的文档表示为一袋文字。

NMF is in its general definition the search for 2 matrices W and H such that W*H=V where V is an observed matrix. NMF在其一般定义中搜索2个矩阵W和H，使得W*H=V ，其中V是观察到的矩阵。 The only requirement for those matrices is that all their elements must be non negative. 这些矩阵的唯一要求是它们的所有元素都必须是非负的。

From the above definitions it is clear that in LDA only bag of words frequency counts can be used since a vector of reals makes no sense. 从上面的定义可以清楚地看出，在LDA中，只能使用词包频率计数，因为实数矢量没有意义。 Did we create a word 1.2 times? 我们创建了一个单词1.2次吗？ On the other hand we can use any non negative representation for NMF and in the example tf-idf is used. 另一方面，我们可以使用NMF的任何非负表示，并且在示例中使用tf-idf。

As far as choosing the number of iterations, for the NMF in scikit learn I don't know the stopping criterion although I believe it is the relative improvement of the loss function being smaller than a threshold so you 'll have to experiment. 至于选择迭代次数，对于scikit中的NMF，我不知道停止标准，虽然我认为损失函数的相对改善小于阈值，所以你必须进行实验。 For LDA I suggest checking manually the improvement of the log likelihood in a held out validation set and stopping when it falls under a threshold. 对于LDA，我建议手动检查保持的验证集中的对数似然的改进，并在它低于阈值时停止。

The rest of the parameters depend heavily on the data so I suggest, as suggested by @rpd, that you do a parameter search. 其余的参数在很大程度上取决于数据，因此我建议，正如@rpd所建议的，您进行参数搜索。

So to sum up, LDA can only generate frequencies and NMF can generate any non negative matrix . 总而言之，LDA 只能生成频率，而NMF 可以生成任何非负矩阵 。