使用scikit-learn和hand計算的tf-idf矩陣值的差異

Question

我正在玩scikit-learn找到tf-idf值。

我有一套documents如：

D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."

我想創建一個這樣的矩陣：

   Docs      blue    bright       sky       sun
   D1 tf-idf 0.0000000 tf-idf 0.0000000
   D2 0.0000000 tf-idf 0.0000000 tf-idf
   D3 0.0000000 tf-idf tf-idf tf-idf

所以，我在Python中的代碼是：

import nltk
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

train_set = ["sky is blue", "sun is bright", "sun in the sky is bright"]
stop_words = stopwords.words('english')

transformer = TfidfVectorizer(stop_words=stop_words)

t1 = transformer.fit_transform(train_set).todense()
print t1

我得到的結果矩陣是：

[[ 0.79596054  0.          0.60534851  0.        ]
 [ 0.          0.4472136   0.          0.89442719]
 [ 0.          0.57735027  0.57735027  0.57735027]]

如果我做手計算，那么矩陣應該是：

            Docs  blue      bright       sky       sun
            D1    0.2385    0.0000000  0.0880    0.0000000
            D2    0.0000000 0.0880     0.0000000 0.0880
            D3    0.0000000 0.058      0.058     0.058

我的計算方法如blue為tf = 1/2 = 0.5 ， idf為log(3/1) = 0.477121255 。 因此tf-idf = tf*idf = 0.5*0.477 = 0.2385 。 這樣，我正在計算其他tf-idf值。 現在，我想知道為什么我在手計算矩陣和Python矩陣中得到不同的結果？ 哪個給出了正確的結果？ 我在手工計算中做錯了什么，或者我的Python代碼中有什么問題？

Answer 1

有兩個原因：

你忽略了在這種情況下經常出現的平滑現象
你假設基數為10的對數

根據來源 sklearn不使用這樣的假設。

首先，它平滑文檔計數（所以沒有0 ，永遠）：

df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)

它使用自然對數（ np.log(np.e)==1 ）

idf = np.log(float(n_samples) / df) + 1.0

還應用了默認的l2規范化。 簡而言之，scikit-learn在計算tfidf時會做更多“好看，小事”。 這些方法（他們或你的）都不好。 他們只是更先進。

Answer 2

smooth_idf：布爾值，默認= True

使用平滑版本idf。 有很多版本。 在python中，使用以下版本：$ 1 + log（（N + 1）/ n + 1））$，其中$ N $是文檔總數，$ n $是包含該術語的文檔數。

tf : 1/2, 1/2
idf with smoothing: (log(4/2)+1) ,(log(4/3)+1)
tf-idf : 1/2* (log(4/2)+1) ,1/2 * (log(4/3)+1)
L-2 normalization: 0.79596054 0.60534851

順便說一句，原問題中的第二個可能是錯誤的，應該是相同的。 我從python出來

使用scikit-learn和hand計算的tf-idf矩陣值的差異

問題描述

2 個解決方案

解決方案1
11 2014-06-04 16:28:18

解決方案2
0 2015-06-24 23:16:39

使用scikit-learn和hand計算的tf-idf矩陣值的差異

問題描述

2 個解決方案

解決方案1 11 2014-06-04 16:28:18

解決方案2 0 2015-06-24 23:16:39

解決方案1
11 2014-06-04 16:28:18

解決方案2
0 2015-06-24 23:16:39