简体   繁体   中英

Tf-idf calculation using gensim

I have one tf-idf example from an ISI paper. I'm trying to validate my code by this example. But I get different result from my code.I don't know what the reason is!

Term-document matrix from paper:

acceptance     [ 0 1 0 1 1 0
information      0 1 0 1 0 0
media            1 0 1 0 0 2
model            0 0 1 1 0 0
selection        1 0 1 0 0 0 
technology       0 1 0 1 1 0]

Tf-idf matrix from paper:

acceptance     [ 0   0.4   0   0.3   0.7  0
information      0   0.7   0   0.5   0    0
media            0.3  0   0.2   0    0    1
model            0    0   0.6   0.5  0    0
selection        0.9  0   0.6   0    0    0 
technology       0   0.4   0   0.3   0.7  0]

My tf-idf matrix:

acceptance     [ 0   0.4   0   0.3   0.7  0
information      0   0.7   0   0.5   0    0
media            0.5  0   0.4   0    0    1
model            0    0   0.6   0.5  0    0
selection        0.8  0   0.6   0    0    0 
technology       0   0.4   0   0.3   0.7  0]

My code:

tfidf = models.TfidfModel(corpus)   
corpus_tfidf=tfidf[corpus]

I've tried another code like this:

transformer = TfidfTransformer()
tfidf=transformer.fit_transform(counts).toarray() ##counts is term-document matrix

But I didn't get appropriate answer

The reason of this difference between results as you mentioned is that there are many methods to calculate TF-IDF in papers. if you read Wikipedia TF-IDF page it mentioned that TF-IDF is calculated as

tfidf(t,d,D) = tf(t,d) . idf(t,D)

and both of tf(t,d) and idf(t,D) can be calculated with different functions that will change last result of TF_IDF value. Actually functions are different for their usage in different applications.

Gensim TF-IDF Model can calculate any function for tf(t,d) and idf(t,D) as it mentioned in it's documentation.

Compute tf-idf by multiplying a local component (term frequency) with a global component (inverse document frequency), and normalizing the resulting documents to unit length. Formula for unnormalized weight of term i in document j in a corpus of D documents:

weight_{i,j} = frequency_{i,j} * log_2(D / document_freq_{i})

or, more generally:

weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document_freq_{i}, D)

so you can plug in your own custom wlocal and wglobal functions.

Default for wlocal is identity (other options: math.sqrt, math.log1p, ...) and default for wglobal is log_2(total_docs / doc_freq), giving the formula above.

Now if you want to reach exactly the paper result, you must know what functions it used for calculating TF-IDF matrix.

Also there is a good example in Gensim google group that shows how you can use custom function for calculating TF-IDF.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM