Calculate tf-idf weight for only given word list with sklearn

Question

I want to get tf-idf weights for given word list from the documents. for example, I have the words interested in like below.

document_list = ['''document 1 blabla''', '''document 2 blabla''']
words = ['project', 'management', 'uml theory', 'wireframe']

Of course I can get terms and weights from documents using sklearn. but I want to get only the weight of above words from the document group using scikit-learn. Any idea will help me a lot.

Answer 1

This is as easy as fitting TfidfVectorizer to your fixed list of desired words and then using your model.

Proof:

from sklearn.feature_extraction.text import TfidfVectorizer
words = ['project', 'management', 'uml theory', 'wireframe']
mod_tfidf = TfidfVectorizer()
mod_tfidf.fit_transform(words)
<4x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in Compressed Sparse Row format>

Add one word more and see that number of second dimensions is still 5 :

mod_tfidf.transform(words + ["dummy"])
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in Compressed Sparse Row format>

Edit :

given your updated question and comment:

mod_tfidf.fit(words)
mod_tfidf.transform(document_list)

Edit2 :

For the sake of completeness, initializing TfidfVectorizer with vocabulary param also delivers the same results. Pay attention in this case words is list of separate single words:

mod_tfidf = TfidfVectorizer(vocabulary=words)

In this case ordering of the resulting features will be fixed by your words order. You may check it by:

mod_tfidf.get_feature_names()

Calculate tf-idf weight for only given word list with sklearn

Question

1 answers

solution1
1 ACCPTED 2019-02-14 06:24:28

Calculate tf-idf weight for only given word list with sklearn

Question

1 answers

solution1 1 ACCPTED 2019-02-14 06:24:28

solution1
1 ACCPTED 2019-02-14 06:24:28