简体   繁体   中英

Finding Tf-Idf Scores of only selected words from set of documents using scikit-learn

I have a set of documents (stored as .txt files). I Also have a python dictionary of some selected words. I want to assign tf-idf scores only to these words, and not all words, from the set of documents. How can this be done using scikit-learn or any other library ?

I have referred to this blog post but it gives scores of full vocabulary.

You can do it with CountVectorizer , which scans the document as text and converts into a term-document matrix, and using TfidfTrasnformer on the matrix.

These two steps can also be combined and done together with the TfidfVectorizer .

These are in the sklearn.feature_extraction.text module [ link ].

Both processes will return the same sparse matrix representation, on which I presume you will probably do SVD transform by TruncatedSVD to get a smaller dense matrix.

You can also of course do it yourself, which requires keeping two maps, one for each document, and one overall, where you count the terms. That is how they operate under the hood.

This page has some nice examples.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM