I want to get tf-idf weights for given word list from the documents. for example, I have the words interested in like below.
document_list = ['''document 1 blabla''', '''document 2 blabla''']
words = ['project', 'management', 'uml theory', 'wireframe']
Of course I can get terms and weights from documents using sklearn. but I want to get only the weight of above words from the document group using scikit-learn. Any idea will help me a lot.
This is as easy as fitting TfidfVectorizer
to your fixed list of desired words and then using your model.
Proof:
from sklearn.feature_extraction.text import TfidfVectorizer
words = ['project', 'management', 'uml theory', 'wireframe']
mod_tfidf = TfidfVectorizer()
mod_tfidf.fit_transform(words)
<4x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
Add one word more and see that number of second dimensions is still 5
:
mod_tfidf.transform(words + ["dummy"])
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
Edit :
given your updated question and comment:
mod_tfidf.fit(words)
mod_tfidf.transform(document_list)
Edit2 :
For the sake of completeness, initializing TfidfVectorizer
with vocabulary
param also delivers the same results. Pay attention in this case words
is list of separate single words:
mod_tfidf = TfidfVectorizer(vocabulary=words)
In this case ordering of the resulting features will be fixed by your words
order. You may check it by:
mod_tfidf.get_feature_names()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.