简体   繁体   English

如何获得单词的 TF-IDF 分数?

[英]How to get TF-IDF scores for the words?

I have a large corpus (around 400k unique sentences).我有一个很大的语料库(大约 40 万个独特的句子)。 I just want to get TF-IDF score for each word.我只想获得每个单词的 TF-IDF 分数。 I tried to calculate the score for each word by scanning each word and calculating the frequency but it's taking too long.我试图通过扫描每个单词并计算频率来计算每个单词的分数,但时间太长了。

I used :我用了 :

  X= tfidfVectorizer(corpus)

from sklearn but it directly gives back the vector representation of the sentence.来自 sklearn 但它直接返回句子的向量表示。 Is there any way I can get the TF-IDF scores for each word in the corpus?有什么方法可以获得语料库中每个单词的 TF-IDF 分数?

To use sklearn.feature_extraction.text.TfidfVectorizer (taken from the docs):使用sklearn.feature_extraction.text.TfidfVectorizer (取自文档):

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.shape)
(4, 9)

Now, if I print X.toarray() :现在,如果我打印X.toarray()

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]

Each row in this 2D array refers to a document, and each element in the row refers to the TF-IDF score of the corresponding word.这个二维数组中的每一行都指的是一个文档,行中的每个元素指的是对应单词的 TF-IDF 分数。 To know what word each element is representing, look at the .get_feature_names() function.要知道每个元素代表什么单词,请查看.get_feature_names()函数。 It will print out a list of words.它将打印出一个单词列表。 For example, in this case, look at the row for the first document:例如,在本例中,查看第一个文档的行:

[0., 0.46979139, 0.58028582, 0.38408524, 0., 0., 0.38408524, 0., 0.38408524]

In the example, .get_feature_names() returns this:在示例中, .get_feature_names()返回:

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Therefore, you map the scores to the words like this:因此,您可以将分数映射到这样的单词:

dict(zip(vectorizer.get_feature_names(), X.toarray()[0]))
{'and': 0.0, 'document': 0.46979139, 'first': 0.58028582, 'is': 0.38408524, 'one': 0.0, 'second': 0.0, 'the': 0.38408524, 'third': 0.0, 'this': 0.38408524}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM