简体   繁体   中英

scikit-learn - Should I fit model with TF or TF-IDF?

I am trying to find out the best way to fit different probabilistic models (like Latent Dirichlet Allocation, Non-negative Matrix Factorization, etc) on sklearn (Python).

Looking at the example in the sklearn documentation, I was wondering why the LDA model is fit on a TF array, while the NMF model is fit on a TF-IDF array. Is there a precise reason for this choice?

Here is the example: http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py

Also, any tips about how to find the best parameters (number of iterations, number of topics...) for fitting my models is well accepted.

Thank you in advance.

To make the answer clear one must first examine the definitions of the two models.

LDA is a probabilistic generative model that generates documents by sampling a topic for each word and then a word from the sampled topic. The generated document is represented as a bag of words.

NMF is in its general definition the search for 2 matrices W and H such that W*H=V where V is an observed matrix. The only requirement for those matrices is that all their elements must be non negative.

From the above definitions it is clear that in LDA only bag of words frequency counts can be used since a vector of reals makes no sense. Did we create a word 1.2 times? On the other hand we can use any non negative representation for NMF and in the example tf-idf is used.

As far as choosing the number of iterations, for the NMF in scikit learn I don't know the stopping criterion although I believe it is the relative improvement of the loss function being smaller than a threshold so you 'll have to experiment. For LDA I suggest checking manually the improvement of the log likelihood in a held out validation set and stopping when it falls under a threshold.

The rest of the parameters depend heavily on the data so I suggest, as suggested by @rpd, that you do a parameter search.

So to sum up, LDA can only generate frequencies and NMF can generate any non negative matrix .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM