简体   繁体   中英

dtype comparison: == and isin produce different results for “object”

Minimal example:

df = pd.DataFrame({'x': ['a', 'b', 'c'], 'y': [1, 2, 3], 'z': ['d', 'e', 'f']})
df

   x  y  z
0  a  1  d
1  b  2  e
2  c  3  f

df.dtypes

x    object
y     int64
z    object
dtype: object

The idea is to filter out columns which are of object type. I know this can be done using select_dtypes , the motivation behind this question is to examine the weird behaviour behind what I'm about to show you.

== (and as a consequence, .eq ) work for comparing a specific type.

df.dtypes == object

x     True
y    False
z     True
dtype: bool

However, isin does not:

df.dtypes.isin([object])
df.dtypes.isin(['object'])

x    False
y    False
z    False
dtype: bool

OTOH, creating a np.dtype object and passing that does .

df.dtypes.isin([np.dtype('O')])

x     True
y    False
z     True
dtype: bool

np.isin works here, so there's no reason for this to behave any differently.

np.isin(df.dtypes, object)
array([ True, False,  True])

np.isin(df.dtypes, 'object')
array([ True, False,  True])

isin seems to be causing trouble when checking for object types only. df.dtypes.isin(['int']) gives the expected result.

As a side note, I'm running these tests on 0.24.

pd.__version__
'0.24.2'

Is this a bug, or expected behaviour?

This comes down to pandas.Series.isin relying on hash tables in this case whereas in 0.20.3 this could have gone down a different code path and used np.in1d depending on your version of python/numpy .

Note that the hashes of np.dtype('O') and object are different, which explains the current failure:

In [2]: hash(np.dtype('O'))
Out[2]: 7065344498483383396

In [3]: hash(object)
Out[3]: 108607961

It looks like np.in1d is doing direct equality comparisons for objects, and the equality with object / 'object' is built into the definition of np.dtype('O') independent of hashes.

It also illustrates a larger issue with isin for pandas: objects that compare equally but have different hashes will fail isin for the small input case. Consider the following class:

class Foo(object):
    def __init__(self, hash_val):
        self.hash_val = hash_val

    def __hash__(self):
        return self.hash_val

    def __eq__(self, other):
        return isinstance(other, Foo)

Then we get:

In [5]: s = pd.Series([Foo(0), Foo(1), Foo(2)])

In [6]: s == Foo(3)
Out[6]:
0    True
1    True
2    True
dtype: bool

In [7]: s.isin([Foo(3)])
Out[7]:
0    False
1    False
2    False
dtype: bool

In [8]: np.in1d(s.values, [Foo(3)])
Out[8]: array([ True,  True,  True])

Is this a bug? Probably, but I'm guessing it'd be a low priority item to fix, given that this is a bit of a corner case and likely non-trivial to fix in a performant manner (ie the current implementation has a comment indicating that object dtypes shouldn't be passed to np.in1d as it could raise, so simply delegating to np.in1d won't work).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM