简体繁体中英

Finding Tf-Idf Scores of only selected words from set of documents using scikit-learn

原文 2016-03-16 16:38:51 7 1 python/ scipy/ nlp/ scikit-learn/ tf-idf

I have a set of documents (stored as .txt files). I Also have a python dictionary of some selected words. I want to assign tf-idf scores only to these words, and not all words, from the set of documents. How can this be done using scikit-learn or any other library ?

I have referred to this blog post but it gives scores of full vocabulary.

1 answers

You can do it with CountVectorizer , which scans the document as text and converts into a term-document matrix, and using TfidfTrasnformer on the matrix.

These two steps can also be combined and done together with the TfidfVectorizer .

These are in the sklearn.feature_extraction.text module [ link ].

Both processes will return the same sparse matrix representation, on which I presume you will probably do SVD transform by TruncatedSVD to get a smaller dense matrix.

You can also of course do it yourself, which requires keeping two maps, one for each document, and one overall, where you count the terms. That is how they operate under the hood.

This page has some nice examples.

Get the document name in scikit-learn tf-idf matrix

Python Scikit-learn: Empty Vocabulary in TF-IDF

Group features of TF-IDF vector in scikit-learn

Difference in values of tf-idf matrix using scikit-learn and hand calculation

Interpreting the sum of TF-IDF scores of words across documents

Scikit Learn - Calculating TF-IDF from a corpus of arrays of features instead of from a corpus of raw documents

scikit-learn - Should I fit model with TF or TF-IDF?

Getting TF-IDF Scores Of Words Using Gensim

Find the words with specified tf-idf scores

How to get TF-IDF scores for the words?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Get the document name in scikit-learn tf-idf matrix Python Scikit-learn: Empty Vocabulary in TF-IDF Group features of TF-IDF vector in scikit-learn Difference in values of tf-idf matrix using scikit-learn and hand calculation Interpreting the sum of TF-IDF scores of words across documents Scikit Learn - Calculating TF-IDF from a corpus of arrays of features instead of from a corpus of raw documents scikit-learn - Should I fit model with TF or TF-IDF? Getting TF-IDF Scores Of Words Using Gensim Find the words with specified tf-idf scores How to get TF-IDF scores for the words?

Related Tags

Finding Tf-Idf Scores of only selected words from set of documents using scikit-learn

Question

1 answers

solution1 1 ACCPTED 2016-03-16 16:46:33

solution1
1 ACCPTED 2016-03-16 16:46:33