简体   繁体   中英

Implementing a TF-IDF Vectorizer from Scratch

I am trying to implement a tf-idf vectorizer from scratch in Python. I computed my TDF values but the values do not match with the TDF values computed using sklearn's TfidfVectorizer().

What am I doing wrong?

corpus = [
 'this is the first document',
 'this document is the second document',
 'and this is the third one',
 'is this the first document',
]

from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy

sentence = []
for i in range(len(corpus)):
sentence.append(corpus[i].split())

word_freq = {}   #calculate document frequency of a word
for i in range(len(sentence)):
    tokens = sentence[i]
    for w in tokens:
        try:
            word_freq[w].add(i)  #add the word as key 
        except:
            word_freq[w] = {i}  #if it exists already, do not add.

for i in word_freq:
    word_freq[i] = len(word_freq[i])  #Counting the number of times a word(key)is in the whole corpus thus giving us the frequency of that word.

def idf():
    idfDict = {}
    for word in word_freq:
        idfDict[word] = math.log(len(sentence) / word_freq[word])
    return idfDict
idfDict = idf()

expected output: (output obtained using vectorizer.idf_)

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073 1.22314355 1.91629073 1.        ]

actual output: (the values are the idf values of corresponding keys.

{'and': 1.3862943611198906,
'document': 0.28768207245178085,
'first': 0.6931471805599453,
'is': 0.0,
'one': 1.3862943611198906,
'second': 1.3862943611198906,
'the': 0.0,
'third': 1.3862943611198906,
'this': 0.0
 }

There are a few default parameters that might affect what sklearn is calculating, but the particular one here that seems to matter is:

smooth_idf : boolean (default=True) Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

If you subtract one from each element and raise e to that power, you get values that are very close to 5 / n, for low values of n:

1.91629073 => 5/2
1.22314355 => 5/4
1.51082562 => 5/3
1 => 5/5

At any rate, there is not a single tf-idf implementation; the metric you define is simply a heuristic that tries to observe certain properties (like "a higher idf should correlate with rarity in the corpus") so I wouldn't worry too much about achieving an identical implementation.

sklearn appears to have used: log((document_length + 1) / (frequency of word + 1)) + 1 which is rather like if there was a document that had every single word in the corpus.

Edit: this last paragraph is corroborated by the docstring for TfIdfNormalizer .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM