I have a large corpus (around 400k unique sentences). I just want to get TF-IDF score for each word. I tried to calculate the score for each word by scanning each word and calculating the frequency but it's taking too long.
I used :
X= tfidfVectorizer(corpus)
from sklearn but it directly gives back the vector representation of the sentence. Is there any way I can get the TF-IDF scores for each word in the corpus?
To use sklearn.feature_extraction.text.TfidfVectorizer
(taken from the docs):
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.shape)
(4, 9)
Now, if I print X.toarray()
:
[[0. 0.46979139 0.58028582 0.38408524 0. 0.
0.38408524 0. 0.38408524]
[0. 0.6876236 0. 0.28108867 0. 0.53864762
0.28108867 0. 0.28108867]
[0.51184851 0. 0. 0.26710379 0.51184851 0.
0.26710379 0.51184851 0.26710379]
[0. 0.46979139 0.58028582 0.38408524 0. 0.
0.38408524 0. 0.38408524]]
Each row in this 2D array refers to a document, and each element in the row refers to the TF-IDF score of the corresponding word. To know what word each element is representing, look at the .get_feature_names()
function. It will print out a list of words. For example, in this case, look at the row for the first document:
[0., 0.46979139, 0.58028582, 0.38408524, 0., 0., 0.38408524, 0., 0.38408524]
In the example, .get_feature_names()
returns this:
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
Therefore, you map the scores to the words like this:
dict(zip(vectorizer.get_feature_names(), X.toarray()[0]))
{'and': 0.0, 'document': 0.46979139, 'first': 0.58028582, 'is': 0.38408524, 'one': 0.0, 'second': 0.0, 'the': 0.38408524, 'third': 0.0, 'this': 0.38408524}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.