简体   繁体   中英

Which formula of tf-idf does the LSA model of gensim use?

There are many different ways in which tf and idf can be calculated. I want to know which formula is used by gensim in its LSA model. I have been going through its source code lsimodel.py , but it is not obvious to me where the document-term matrix is created (probably because of memory optimizations).

In one LSA paper , I read that each cell of the document-term matrix is the log-frequency of that word in that document, divided by the entropy of that word:

tf(w, d) = log(1 + frequency(w, d))
idf(w, D) = 1 / (-Σ_D p(w) log p(w))

However, this seems to be a very unusual formulation of tf-idf. A more familiar form of tf-idf is:

tf(w, d) = frequency(w, d)
idf(w, D) = log(|D| / |{d ∈ D: w ∈ d}|)

I also notice that there is a question on how the TfIdfModel itself is implemented in gensim . However, I didn't see lsimodel.py importing TfIdfModel , and therefore can only assume that lsimodel.py has its own implementation of tf-idf.

As I understand, lsimodel.py does not preform the tf-idf encoding step. You may find some details in gensim's API documentation - there's a dedicated tf-idf model, which can be employed to encode a text that can be later fed into the LSA model. From the tfidfmodel.py source code it appears that the latter of two definitions of tf-idf you listed is followed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM