简体   繁体   中英

Calculate tf-idf weight for only given word list with sklearn

I want to get tf-idf weights for given word list from the documents. for example, I have the words interested in like below.

document_list = ['''document 1 blabla''', '''document 2 blabla''']
words = ['project', 'management', 'uml theory', 'wireframe']

Of course I can get terms and weights from documents using sklearn. but I want to get only the weight of above words from the document group using scikit-learn. Any idea will help me a lot.

This is as easy as fitting TfidfVectorizer to your fixed list of desired words and then using your model.

Proof:

from sklearn.feature_extraction.text import TfidfVectorizer
words = ['project', 'management', 'uml theory', 'wireframe']
mod_tfidf = TfidfVectorizer()
mod_tfidf.fit_transform(words)
<4x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in Compressed Sparse Row format>

Add one word more and see that number of second dimensions is still 5 :

mod_tfidf.transform(words + ["dummy"])
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in Compressed Sparse Row format>

Edit :

given your updated question and comment:

mod_tfidf.fit(words)
mod_tfidf.transform(document_list)

Edit2 :

For the sake of completeness, initializing TfidfVectorizer with vocabulary param also delivers the same results. Pay attention in this case words is list of separate single words:

mod_tfidf = TfidfVectorizer(vocabulary=words)

In this case ordering of the resulting features will be fixed by your words order. You may check it by:

mod_tfidf.get_feature_names()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM