我在理解sklearn的TfidfVectorizer结果时遇到问题

Question

Given a corpus of 3 documents, for example: 给定3个文档的语料库，例如：

   sentences = ["This car is fast",
                "This car is pretty",
                "Very fast truck"]

I am executing by hand the calculation of tf-idf. 我正在手动执行tf-idf的计算。

For document 1, and the word "car", I can find that: 对于文档1和单词“ car”，我可以找到：

TF = 1/4
IDF = log(3/2)
TF-IDF = 1/4 * log(3/2)

Same result should apply to document 2, since it has 4 words, and one of them is "car". 同样的结果应适用于文档2，因为它有4个单词，其中一个是“ car”。

I have tried to apply this in sklearn, with the code below: 我尝试使用以下代码在sklearn中应用此代码：

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

data = {'text': sentences}
df = pd.DataFrame(data)
tv = TfidfVectorizer()
tfvector = tv.fit_transform(df.text)
print(pd.DataFrame(tfvector.toarray(), columns=tv.get_feature_names()))

And the result I get is: 我得到的结果是：

        car     fast        is    pretty      this     truck      very
0  0.500000  0.50000  0.500000  0.000000  0.500000  0.000000  0.000000
1  0.459854  0.00000  0.459854  0.604652  0.459854  0.000000  0.000000
2  0.000000  0.47363  0.000000  0.000000  0.000000  0.622766  0.622766

I can understand that sklearn uses L2 normalization, but still, shouldn't the tf-idf score of "car" in the first two documents be the same? 我可以理解sklearn使用L2归一化，但是前两个文档中“ car”的tf-idf分数不应该相同吗？ Can anyone help me understanding the results? 谁能帮助我了解结果？

Answer 1

It is because of the normalization. 这是由于标准化。 If you add the parameter norm=None to the TfIdfVectorizer(norm=None) , you will get the following result, which has the same value for car 如果将参数norm=None添加到TfIdfVectorizer(norm=None) ，则会得到以下结果，该结果具有与car相同的值

        car      fast        is    pretty      this     truck      very
0  1.287682  1.287682  1.287682  0.000000  1.287682  0.000000  0.000000
1  1.287682  0.000000  1.287682  1.693147  1.287682  0.000000  0.000000
2  0.000000  1.287682  0.000000  0.000000  0.000000  1.693147  1.693147

我在理解sklearn的TfidfVectorizer结果时遇到问题

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-07-06 09:31:36

我在理解sklearn的TfidfVectorizer结果时遇到问题

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-07-06 09:31:36

解决方案1
0 已采纳 2019-07-06 09:31:36