简体   繁体   English

gensim的LSA模型使用哪个tf-idf公式?

[英]Which formula of tf-idf does the LSA model of gensim use?

There are many different ways in which tf and idf can be calculated. 有许多不同的方法可以计算tf和idf。 I want to know which formula is used by gensim in its LSA model. 我想知道gensim在其LSA模型中使用了哪个公式。 I have been going through its source code lsimodel.py , but it is not obvious to me where the document-term matrix is created (probably because of memory optimizations). 我一直在浏览其源代码lsimodel.py ,但是对我而言,创建文档项矩阵的位置并不明显(可能是由于内存优化lsimodel.py )。

In one LSA paper , I read that each cell of the document-term matrix is the log-frequency of that word in that document, divided by the entropy of that word: 一篇LSA论文中 ,我读到文档项矩阵的每个单元都是该单词在该文档中的对数频率除以该单词的熵:

tf(w, d) = log(1 + frequency(w, d))
idf(w, D) = 1 / (-Σ_D p(w) log p(w))

However, this seems to be a very unusual formulation of tf-idf. 但是,这似乎是tf-idf的非常特殊的表述。 A more familiar form of tf-idf is: tf-idf更常见的形式是:

tf(w, d) = frequency(w, d)
idf(w, D) = log(|D| / |{d ∈ D: w ∈ d}|)

I also notice that there is a question on how the TfIdfModel itself is implemented in gensim . 我还注意到, TfIdfModel如何实现TfIdfModel本身存在一个问题 However, I didn't see lsimodel.py importing TfIdfModel , and therefore can only assume that lsimodel.py has its own implementation of tf-idf. 但是,我没有看到lsimodel.py导入TfIdfModel ,因此只能假定lsimodel.py具有自己的tf-idf实现。

As I understand, lsimodel.py does not preform the tf-idf encoding step. 据我了解, lsimodel.py不会执行tf-idf编码步骤。 You may find some details in gensim's API documentation - there's a dedicated tf-idf model, which can be employed to encode a text that can be later fed into the LSA model. 您可以在gensim的API文档中找到一些详细信息-有一个专用的tf-idf模型,可以使用该模型对文本进行编码,然后将其输入LSA模型。 From the tfidfmodel.py source code it appears that the latter of two definitions of tf-idf you listed is followed. tfidfmodel.py 源代码看来,遵循了您列出的tf-idf的两个定义中的后者。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM