[英]Difference in values of tf-idf matrix using scikit-learn and hand calculation
I am playing with scikit-learn
to find the tf-idf
values. 我正在玩scikit-learn
找到tf-idf
值。
I have a set of documents
like: 我有一套documents
如:
D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."
I want to create a matrix like this: 我想创建一个这样的矩阵:
Docs blue bright sky sun
D1 tf-idf 0.0000000 tf-idf 0.0000000
D2 0.0000000 tf-idf 0.0000000 tf-idf
D3 0.0000000 tf-idf tf-idf tf-idf
So, my code in Python
is: 所以,我在Python
中的代码是:
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
train_set = ["sky is blue", "sun is bright", "sun in the sky is bright"]
stop_words = stopwords.words('english')
transformer = TfidfVectorizer(stop_words=stop_words)
t1 = transformer.fit_transform(train_set).todense()
print t1
The result matrix I get is: 我得到的结果矩阵是:
[[ 0.79596054 0. 0.60534851 0. ]
[ 0. 0.4472136 0. 0.89442719]
[ 0. 0.57735027 0.57735027 0.57735027]]
If I do a hand calculation then the matrix should be: 如果我做手计算,那么矩阵应该是:
Docs blue bright sky sun
D1 0.2385 0.0000000 0.0880 0.0000000
D2 0.0000000 0.0880 0.0000000 0.0880
D3 0.0000000 0.058 0.058 0.058
I am calculating like say blue
as tf
= 1/2 = 0.5
and idf
as log(3/1) = 0.477121255
. 我的计算方法如blue
为tf
= 1/2 = 0.5
, idf
为log(3/1) = 0.477121255
。 Therefore tf-idf = tf*idf = 0.5*0.477 = 0.2385
. 因此tf-idf = tf*idf = 0.5*0.477 = 0.2385
。 In this way, I am calculating the other tf-idf
values. 这样,我正在计算其他tf-idf
值。 Now, I am wondering, why I am getting different results in the matrix of hand calculation and in the matrix of Python? 现在,我想知道为什么我在手计算矩阵和Python矩阵中得到不同的结果? Which gives the correct results? 哪个给出了正确的结果? Am I doing something wrong in hand calculation or is there something wrong in my Python code? 我在手工计算中做错了什么,或者我的Python代码中有什么问题?
There are two reasons: 有两个原因:
According to source sklearn does not use such assumptions. 根据来源 sklearn不使用这样的假设。
First, it smooths document count (so there is no 0
, ever): 首先,它平滑文档计数(所以没有0
,永远):
df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)
and it uses natural logarithm ( np.log(np.e)==1
) 它使用自然对数( np.log(np.e)==1
)
idf = np.log(float(n_samples) / df) + 1.0
There is also default l2
normalization applied. 还应用了默认的l2
规范化。 In short, scikit-learn does much more "nice, little things" while computing tfidf. 简而言之,scikit-learn在计算tfidf时会做更多“好看,小事”。 None of these approaches (their or yours) is bad. 这些方法(他们或你的)都不好。 Their is simply more advanced. 他们只是更先进。
smooth_idf : boolean, default=True smooth_idf:布尔值,默认= True
Smoothed version idf is used. 使用平滑版本idf。 There are many versions. 有很多版本。 In python, the following version is used: $1+ log( (N+1)/n+1))$, where $N$ the number of total number of documents, and $n$ the number of documents containing the term. 在python中,使用以下版本:$ 1 + log((N + 1)/ n + 1))$,其中$ N $是文档总数,$ n $是包含该术语的文档数。
tf : 1/2, 1/2
idf with smoothing: (log(4/2)+1) ,(log(4/3)+1)
tf-idf : 1/2* (log(4/2)+1) ,1/2 * (log(4/3)+1)
L-2 normalization: 0.79596054 0.60534851
By the way, the 2nd in the original problem maybe wrong, which should be the same. 顺便说一句,原问题中的第二个可能是错误的,应该是相同的。 my out put from python 我从python出来
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.