[英]I have a problem understanding sklearn's TfidfVectorizer results
Given a corpus of 3 documents, for example: 给定3个文档的语料库,例如:
sentences = ["This car is fast",
"This car is pretty",
"Very fast truck"]
I am executing by hand the calculation of tf-idf. 我正在手动执行tf-idf的计算。
For document 1, and the word "car", I can find that: 对于文档1和单词“ car”,我可以找到:
TF = 1/4
IDF = log(3/2)
TF-IDF = 1/4 * log(3/2)
Same result should apply to document 2, since it has 4 words, and one of them is "car". 同样的结果应适用于文档2,因为它有4个单词,其中一个是“ car”。
I have tried to apply this in sklearn, with the code below: 我尝试使用以下代码在sklearn中应用此代码:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
data = {'text': sentences}
df = pd.DataFrame(data)
tv = TfidfVectorizer()
tfvector = tv.fit_transform(df.text)
print(pd.DataFrame(tfvector.toarray(), columns=tv.get_feature_names()))
And the result I get is: 我得到的结果是:
car fast is pretty this truck very
0 0.500000 0.50000 0.500000 0.000000 0.500000 0.000000 0.000000
1 0.459854 0.00000 0.459854 0.604652 0.459854 0.000000 0.000000
2 0.000000 0.47363 0.000000 0.000000 0.000000 0.622766 0.622766
I can understand that sklearn uses L2 normalization, but still, shouldn't the tf-idf score of "car" in the first two documents be the same? 我可以理解sklearn使用L2归一化,但是前两个文档中“ car”的tf-idf分数不应该相同吗? Can anyone help me understanding the results?
谁能帮助我了解结果?
It is because of the normalization. 这是由于标准化。 If you add the parameter
norm=None
to the TfIdfVectorizer(norm=None)
, you will get the following result, which has the same value for car
如果将参数
norm=None
添加到TfIdfVectorizer(norm=None)
,则会得到以下结果,该结果具有与car
相同的值
car fast is pretty this truck very
0 1.287682 1.287682 1.287682 0.000000 1.287682 0.000000 0.000000
1 1.287682 0.000000 1.287682 1.693147 1.287682 0.000000 0.000000
2 0.000000 1.287682 0.000000 0.000000 0.000000 1.693147 1.693147
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.