简体繁体中英

scikit-learn - Should I fit model with TF or TF-IDF?

原文 2016-10-21 07:55:59 8 1 python/ scikit-learn/ tf-idf/ matrix-factorization/ latent-semantic-indexing

I am trying to find out the best way to fit different probabilistic models (like Latent Dirichlet Allocation, Non-negative Matrix Factorization, etc) on sklearn (Python).

Looking at the example in the sklearn documentation, I was wondering why the LDA model is fit on a TF array, while the NMF model is fit on a TF-IDF array. Is there a precise reason for this choice?

Here is the example: http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py

Also, any tips about how to find the best parameters (number of iterations, number of topics...) for fitting my models is well accepted.

Thank you in advance.

1 answers

To make the answer clear one must first examine the definitions of the two models.

LDA is a probabilistic generative model that generates documents by sampling a topic for each word and then a word from the sampled topic. The generated document is represented as a bag of words.

NMF is in its general definition the search for 2 matrices W and H such that W*H=V where V is an observed matrix. The only requirement for those matrices is that all their elements must be non negative.

From the above definitions it is clear that in LDA only bag of words frequency counts can be used since a vector of reals makes no sense. Did we create a word 1.2 times? On the other hand we can use any non negative representation for NMF and in the example tf-idf is used.

As far as choosing the number of iterations, for the NMF in scikit learn I don't know the stopping criterion although I believe it is the relative improvement of the loss function being smaller than a threshold so you 'll have to experiment. For LDA I suggest checking manually the improvement of the log likelihood in a held out validation set and stopping when it falls under a threshold.

The rest of the parameters depend heavily on the data so I suggest, as suggested by @rpd, that you do a parameter search.

So to sum up, LDA can only generate frequencies and NMF can generate any non negative matrix .

Get the document name in scikit-learn tf-idf matrix

Python Scikit-learn: Empty Vocabulary in TF-IDF

Group features of TF-IDF vector in scikit-learn

Difference in values of tf-idf matrix using scikit-learn and hand calculation

Finding Tf-Idf Scores of only selected words from set of documents using scikit-learn

TF-IDF Simple Use - NLTK/Scikit Learn

Scikit - TF-IDF empty vocabulary

tf-idf : should I do normalization of documents length

Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

Scikit Learn - Calculating TF-IDF from a corpus of arrays of features instead of from a corpus of raw documents

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Get the document name in scikit-learn tf-idf matrix Python Scikit-learn: Empty Vocabulary in TF-IDF Group features of TF-IDF vector in scikit-learn Difference in values of tf-idf matrix using scikit-learn and hand calculation Finding Tf-Idf Scores of only selected words from set of documents using scikit-learn TF-IDF Simple Use - NLTK/Scikit Learn Scikit - TF-IDF empty vocabulary tf-idf : should I do normalization of documents length Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score Scikit Learn - Calculating TF-IDF from a corpus of arrays of features instead of from a corpus of raw documents

Related Tags

scikit-learn - Should I fit model with TF or TF-IDF?

Question

1 answers

solution1 3 2016-10-24 17:24:39

solution1
3 2016-10-24 17:24:39