[英]Scikit-learn TfidfTranformer yielding wrong results?
I'm getting "weird" results using scikit-learn's Tfidf transformer. 我正在使用scikit-learn的Tfidf变压器获得“怪异”的结果。 Normally, I would expect a word, that occurs in all documents in a corpus to have an idf equal to 0 (using no sort of smoothing or normalization), as the formular I would use would be the logarithm of the number of document in the corpus divided by the number of documents containing the term. 通常,我希望一个单词出现在语料库中的所有文档中,其idf等于0(不使用任何平滑或规范化方法),因为我将使用的公式化工具是该文档中文档数的对数语料库除以包含该术语的文档数。 Apparently (as illustrated below) scikit-learn's implementation adds one to each idf value compared to my manual implementation. 显然(如下图所示)与我的手动实现相比,scikit-learn的实现向每个idf值添加了一个。 Does anybody know why? 有人知道为什么吗? Again, notice that I have set smoothing and normalization equal to None/False. 同样,请注意,我将平滑和归一化设置为“无/错误”。
In [101]: from sklearn.feature_extraction.text import TfidfTransformer
In [102]: counts
Out[102]:
array([[3, 0, 1],
[2, 0, 0],
[3, 0, 0],
[4, 0, 0],
[3, 2, 0],
[3, 0, 2]])
In [103]: transformer = TfidfTransformer(norm=None, smooth_idf=False)
In [104]: transformer
Out[104]:
TfidfTransformer(norm=None, smooth_idf=False, sublinear_tf=False,
use_idf=True)
In [105]: tfidf = transformer.fit_transform(counts)
In [106]: tfidf.toarray()
Out[106]:
array([[ 3. , 0. , 2.09861229],
[ 2. , 0. , 0. ],
[ 3. , 0. , 0. ],
[ 4. , 0. , 0. ],
[ 3. , 5.58351894, 0. ],
[ 3. , 0. , 4.19722458]])
In [107]: transformer.idf_
Out[107]: array([ 1. , 2.79175947, 2.09861229])
In [108]: idf1 = np.log(6/6)
In [109]: idf1
Out[109]: 0.0
In [110]: idf2 = np.log(6/1)
In [111]: idf2
Out[111]: 1.791759469228055
In [112]: idf3 = np.log(6/2)
In [113]: idf3
Out[113]: 1.0986122886681098
I have been unable to find any source that justifies adding one to the idf values. 我一直找不到能证明在idf值中添加一个的任何来源。 I'm using scikit-learn version '0.14.1'. 我正在使用scikit-learn版本“ 0.14.1”。
Btw another solution than scikit-learn is not really useful to me, as I need to build a scikit-learn pipeline for gridsearch. 顺便说一句,除了scikit-learn之外,另一种解决方案对我来说并不是真正有用,因为我需要为网格搜索构建一个scikit-learn管道。
This is not a bug, its a feature 这不是错误,它是一个功能
# log1p instead of log makes sure terms with zero idf don't get
# suppressed entirely
idf = np.log(float(n_samples) / df) + 1.0
This +1
(as mentioned in the comment) is used to make idf normalizator weaker , otherwise, elements which occur in all the documents are completely removed (they have idf=0 so whole tfidf=0) 此+1
(如评论中所述)用于使idf normalizator 变弱 ,否则,将完全删除所有文档中出现的元素(它们的idf = 0,所以整个tfidf = 0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.