從頭開始實現 TF-IDF 向量化器

Question

我正在嘗試在 Python 中從頭開始實現一個 tf-idf 向量化器。 我計算了我的 TDF 值，但這些值與使用 sklearn 的 TfidfVectorizer() 計算的 TDF 值不匹配。

我究竟做錯了什么？

corpus = [
 'this is the first document',
 'this document is the second document',
 'and this is the third one',
 'is this the first document',
]

from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy

sentence = []
for i in range(len(corpus)):
sentence.append(corpus[i].split())

word_freq = {}   #calculate document frequency of a word
for i in range(len(sentence)):
    tokens = sentence[i]
    for w in tokens:
        try:
            word_freq[w].add(i)  #add the word as key 
        except:
            word_freq[w] = {i}  #if it exists already, do not add.

for i in word_freq:
    word_freq[i] = len(word_freq[i])  #Counting the number of times a word(key)is in the whole corpus thus giving us the frequency of that word.

def idf():
    idfDict = {}
    for word in word_freq:
        idfDict[word] = math.log(len(sentence) / word_freq[word])
    return idfDict
idfDict = idf()

預期輸出：（使用 vectorizer.idf_ 獲得的輸出）

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073 1.22314355 1.91629073 1.        ]

實際輸出：（值是對應鍵的idf值。

{'and': 1.3862943611198906,
'document': 0.28768207245178085,
'first': 0.6931471805599453,
'is': 0.0,
'one': 1.3862943611198906,
'second': 1.3862943611198906,
'the': 0.0,
'third': 1.3862943611198906,
'this': 0.0
 }

Answer 1

有一些默認參數可能會影響 sklearn 正在計算的內容，但這里的特定參數似乎很重要：

smooth_idf : boolean (default=True)通過在文檔頻率上加 1 來平滑 idf 權重，就好像看到一個額外的文檔只包含集合中的每個術語一次。 防止零除法。

如果從每個元素中減去 1 並將 e 提高到那個冪，對於低 n 值，您會得到非常接近 5 / n 的值：

1.91629073 => 5/2
1.22314355 => 5/4
1.51082562 => 5/3
1 => 5/5

無論如何，沒有一個單獨的 tf-idf 實現； 您定義的指標只是一種嘗試觀察某些屬性的啟發式方法（例如“更高的 idf 應該與語料庫中的稀有性相關”），因此我不會太擔心實現相同的實現。

sklearn 似乎使用了： log((document_length + 1) / (frequency of word + 1)) + 1這就像有一個文檔包含語料庫中的每個單詞一樣。

編輯：最后一段由TfIdfNormalizer的文檔字符串證實。

從頭開始實現 TF-IDF 向量化器

問題描述

1 個解決方案

解決方案1
1 已采納 2019-09-01 22:38:56

從頭開始實現 TF-IDF 向量化器

問題描述

1 個解決方案

解決方案1 1 已采納 2019-09-01 22:38:56

解決方案1
1 已采納 2019-09-01 22:38:56