dtype comparison: == and isin produce different results for “object”

Question

Minimal example:

df = pd.DataFrame({'x': ['a', 'b', 'c'], 'y': [1, 2, 3], 'z': ['d', 'e', 'f']})
df

   x  y  z
0  a  1  d
1  b  2  e
2  c  3  f

df.dtypes

x    object
y     int64
z    object
dtype: object

The idea is to filter out columns which are of object type. I know this can be done using select_dtypes , the motivation behind this question is to examine the weird behaviour behind what I'm about to show you.

== (and as a consequence, .eq ) work for comparing a specific type.

df.dtypes == object

x     True
y    False
z     True
dtype: bool

However, isin does not:

df.dtypes.isin([object])
df.dtypes.isin(['object'])

x    False
y    False
z    False
dtype: bool

OTOH, creating a np.dtype object and passing that does .

df.dtypes.isin([np.dtype('O')])

x     True
y    False
z     True
dtype: bool

np.isin works here, so there's no reason for this to behave any differently.

np.isin(df.dtypes, object)
array([ True, False,  True])

np.isin(df.dtypes, 'object')
array([ True, False,  True])

isin seems to be causing trouble when checking for object types only. df.dtypes.isin(['int']) gives the expected result.

As a side note, I'm running these tests on 0.24.

pd.__version__
'0.24.2'

Is this a bug, or expected behaviour?

Answer 1

This comes down to pandas.Series.isin relying on hash tables in this case whereas in 0.20.3 this could have gone down a different code path and used np.in1d depending on your version of python/numpy .

Note that the hashes of np.dtype('O') and object are different, which explains the current failure:

In [2]: hash(np.dtype('O'))
Out[2]: 7065344498483383396

In [3]: hash(object)
Out[3]: 108607961

It looks like np.in1d is doing direct equality comparisons for objects, and the equality with object / 'object' is built into the definition of np.dtype('O') independent of hashes.

It also illustrates a larger issue with isin for pandas: objects that compare equally but have different hashes will fail isin for the small input case. Consider the following class:

class Foo(object):
    def __init__(self, hash_val):
        self.hash_val = hash_val

    def __hash__(self):
        return self.hash_val

    def __eq__(self, other):
        return isinstance(other, Foo)

Then we get:

In [5]: s = pd.Series([Foo(0), Foo(1), Foo(2)])

In [6]: s == Foo(3)
Out[6]:
0    True
1    True
2    True
dtype: bool

In [7]: s.isin([Foo(3)])
Out[7]:
0    False
1    False
2    False
dtype: bool

In [8]: np.in1d(s.values, [Foo(3)])
Out[8]: array([ True,  True,  True])

Is this a bug? Probably, but I'm guessing it'd be a low priority item to fix, given that this is a bit of a corner case and likely non-trivial to fix in a performant manner (ie the current implementation has a comment indicating that object dtypes shouldn't be passed to np.in1d as it could raise, so simply delegating to np.in1d won't work).

dtype comparison: == and isin produce different results for “object”

Question

1 answers

solution1
6 ACCPTED 2019-06-06 23:54:01

dtype comparison: == and isin produce different results for “object”

Question

1 answers

solution1 6 ACCPTED 2019-06-06 23:54:01

solution1
6 ACCPTED 2019-06-06 23:54:01