简体   繁体   English

为什么vectorizer.fit_transform(x).astype('bool')与vectorizer.set_params(binary = True).fit_transform(x)不同?

[英]Why is vectorizer.fit_transform(x).astype('bool') different from vectorizer.set_params(binary=True).fit_transform(x)?

Here is a minimal example of what I'm talking about: 这是我正在谈论的最小示例:

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

data = fetch_20newsgroups()
x = data.data

vec = TfidfVectorizer(min_df=0.01, max_df=0.5)
mat = vec.fit_transform(x).astype('bool')

vec.set_params(binary=True)
print(np.array_equal(mat, vec.fit_transform(x)))

This prints False . 这将输出False What is the underlying difference between setting binary=True and setting all nonzero values to True ? 是什么设置之间的基本区别binary=True和所有非零值设置为True

EDIT: As answered by @juanpa.arrivillaga, TfidfVectorizer(binary=True) still does the inverse document frequency calculation. 编辑:正如@ juanpa.arrivillaga回答的那样, TfidfVectorizer(binary=True)仍然进行逆文档频率计算。 However, I also noticed CountVectorizer(binary=True) doesn't produce the same output as .astype('bool') either. 但是,我也注意到CountVectorizer(binary=True)也不会产生与.astype('bool')相同的输出。 Below is an example: 下面是一个示例:

In [1]: import numpy as np
   ...: from sklearn.datasets import fetch_20newsgroups
   ...: from sklearn.feature_extraction.text import CountVectorizer
   ...:
   ...: data = fetch_20newsgroups()
   ...: x = data.data
   ...:
   ...: vec = CountVectorizer(min_df=0.01, max_df=0.5)
   ...: a = vec.fit_transform(x).astype('bool')
   ...:
   ...: vec.set_params(binary=True)
   ...: b = vec.fit_transform(x).astype('bool')
   ...: print(np.array_equal(a, b))
   ...:
False

In [2]: a
Out[2]:
<11314x2141 sparse matrix of type '<class 'numpy.bool_'>'
        with 950068 stored elements in Compressed Sparse Row format>

In [3]: b
Out[3]:
<11314x2141 sparse matrix of type '<class 'numpy.bool_'>'
        with 950068 stored elements in Compressed Sparse Row format>

The dimensions and dtype are the same, which leads me to believe the content of those matrices are different. 维度和dtype相同,这使我相信这些矩阵的内容是不同的。 Just by eyeballing the output of print(a) and print(b) , they look the same. 仅仅关注一下print(a)print(b) ,它们看起来就一样。

You are fundamentally confusing two things. 您从根本上混淆了两件事。

One is conversion to the boolean numpy datatype, which is the equivalent to the python data type which accepts two values, True and False, except it is represented as a single byte in the underlying primitive array. 一种是转换为boolean numpy数据类型,这等效于python数据类型,该数据类型接受两个值True和False,不同之处在于它在基础基本数组中表示为单个字节。

Passing the binary argument to the TfidfVectorizer changes the way the data is modeled . binary参数传递给TfidfVectorizer更改数据建模的方式 In short, if you use binary=True , the total counts will be binary, ie either seen or not seen. 简而言之,如果您使用binary=True ,则总计数将为二进制,即可见或不可见。 Then you do the usual tf-id transformation. 然后,您执行通常的tf-id转换。 From the docs : 从文档

If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. 如果为True,则所有非零项计数均设置为1。这并不意味着输出将只有0/1值,只是tf-idf中的tf项是二进制的。 (Set idf and normalization to False to get 0/1 outputs.) (将idf和normalization设置为False以获得0/1输出。)

So you don't even get a boolean output. 因此,您甚至都不会获得布尔输出。

So consider: 因此请考虑:

In [10]: import numpy as np
    ...: from sklearn.feature_extraction.text import TfidfVectorizer
    ...:

In [11]: data = [
    ...:     'The quick brown fox jumped over the lazy dog',
    ...:     'how much wood could a woodchuck chuck if a woodchuck could chuck wood'
    ...: ]

In [12]: TfidfVectorizer().fit_transform(data).todense()
Out[12]:
matrix([[ 0.30151134,  0.        ,  0.        ,  0.30151134,  0.30151134,
          0.        ,  0.        ,  0.30151134,  0.30151134,  0.        ,
          0.30151134,  0.30151134,  0.60302269,  0.        ,  0.        ],
        [ 0.        ,  0.45883147,  0.45883147,  0.        ,  0.        ,
          0.22941573,  0.22941573,  0.        ,  0.        ,  0.22941573,
          0.        ,  0.        ,  0.        ,  0.45883147,  0.45883147]])

In [13]: TfidfVectorizer().fit_transform(data).todense().astype('bool')
Out[13]:
matrix([[ True, False, False,  True,  True, False, False,  True,  True,
         False,  True,  True,  True, False, False],
        [False,  True,  True, False, False,  True,  True, False, False,
          True, False, False, False,  True,  True]], dtype=bool)

And now notice that usin binary will still return a floating-point type: 现在请注意,usin binary仍将返回浮点类型:

In [14]: TfidfVectorizer(binary=True).fit_transform(data).todense()
Out[14]:
matrix([[ 0.35355339,  0.        ,  0.        ,  0.35355339,  0.35355339,
          0.        ,  0.        ,  0.35355339,  0.35355339,  0.        ,
          0.35355339,  0.35355339,  0.35355339,  0.        ,  0.        ],
        [ 0.        ,  0.37796447,  0.37796447,  0.        ,  0.        ,
          0.37796447,  0.37796447,  0.        ,  0.        ,  0.37796447,
          0.        ,  0.        ,  0.        ,  0.37796447,  0.37796447]])

It just changes the results. 它只是改变结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM