使用scikit-learn和hand计算的tf-idf矩阵值的差异

Question

I am playing with scikit-learn to find the tf-idf values. 我正在玩scikit-learn找到tf-idf值。

I have a set of documents like: 我有一套documents如：

D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."

I want to create a matrix like this: 我想创建一个这样的矩阵：

   Docs      blue    bright       sky       sun
   D1 tf-idf 0.0000000 tf-idf 0.0000000
   D2 0.0000000 tf-idf 0.0000000 tf-idf
   D3 0.0000000 tf-idf tf-idf tf-idf

So, my code in Python is: 所以，我在Python中的代码是：

import nltk
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

train_set = ["sky is blue", "sun is bright", "sun in the sky is bright"]
stop_words = stopwords.words('english')

transformer = TfidfVectorizer(stop_words=stop_words)

t1 = transformer.fit_transform(train_set).todense()
print t1

The result matrix I get is: 我得到的结果矩阵是：

[[ 0.79596054  0.          0.60534851  0.        ]
 [ 0.          0.4472136   0.          0.89442719]
 [ 0.          0.57735027  0.57735027  0.57735027]]

If I do a hand calculation then the matrix should be: 如果我做手计算，那么矩阵应该是：

            Docs  blue      bright       sky       sun
            D1    0.2385    0.0000000  0.0880    0.0000000
            D2    0.0000000 0.0880     0.0000000 0.0880
            D3    0.0000000 0.058      0.058     0.058

I am calculating like say blue as tf = 1/2 = 0.5 and idf as log(3/1) = 0.477121255 . 我的计算方法如blue为tf = 1/2 = 0.5 ， idf为log(3/1) = 0.477121255 。 Therefore tf-idf = tf*idf = 0.5*0.477 = 0.2385 . 因此tf-idf = tf*idf = 0.5*0.477 = 0.2385 。 In this way, I am calculating the other tf-idf values. 这样，我正在计算其他tf-idf值。 Now, I am wondering, why I am getting different results in the matrix of hand calculation and in the matrix of Python? 现在，我想知道为什么我在手计算矩阵和Python矩阵中得到不同的结果？ Which gives the correct results? 哪个给出了正确的结果？ Am I doing something wrong in hand calculation or is there something wrong in my Python code? 我在手工计算中做错了什么，或者我的Python代码中有什么问题？

Answer 1

There are two reasons: 有两个原因：

You are neglecting smoothing which often occurs in such cases 你忽略了在这种情况下经常出现的平滑现象
You are assuming logarithm of base 10 你假设基数为10的对数

According to source sklearn does not use such assumptions. 根据来源 sklearn不使用这样的假设。

First, it smooths document count (so there is no 0 , ever): 首先，它平滑文档计数（所以没有0 ，永远）：

df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)

and it uses natural logarithm ( np.log(np.e)==1 ) 它使用自然对数（ np.log(np.e)==1 ）

idf = np.log(float(n_samples) / df) + 1.0

There is also default l2 normalization applied. 还应用了默认的l2规范化。 In short, scikit-learn does much more "nice, little things" while computing tfidf. 简而言之，scikit-learn在计算tfidf时会做更多“好看，小事”。 None of these approaches (their or yours) is bad. 这些方法（他们或你的）都不好。 Their is simply more advanced. 他们只是更先进。

Answer 2

smooth_idf : boolean, default=True smooth_idf：布尔值，默认= True

Smoothed version idf is used. 使用平滑版本idf。 There are many versions. 有很多版本。 In python, the following version is used: $1+ log( (N+1)/n+1))$, where $N$ the number of total number of documents, and $n$ the number of documents containing the term. 在python中，使用以下版本：$ 1 + log（（N + 1）/ n + 1））$，其中$ N $是文档总数，$ n $是包含该术语的文档数。

tf : 1/2, 1/2
idf with smoothing: (log(4/2)+1) ,(log(4/3)+1)
tf-idf : 1/2* (log(4/2)+1) ,1/2 * (log(4/3)+1)
L-2 normalization: 0.79596054 0.60534851

By the way, the 2nd in the original problem maybe wrong, which should be the same. 顺便说一句，原问题中的第二个可能是错误的，应该是相同的。 my out put from python 我从python出来

使用scikit-learn和hand计算的tf-idf矩阵值的差异

问题描述

2 个解决方案

解决方案1
11 2014-06-04 16:28:18

解决方案2
0 2015-06-24 23:16:39

使用scikit-learn和hand计算的tf-idf矩阵值的差异

问题描述

2 个解决方案

解决方案1 11 2014-06-04 16:28:18

解决方案2 0 2015-06-24 23:16:39

解决方案1
11 2014-06-04 16:28:18

解决方案2
0 2015-06-24 23:16:39