Minimal example:
df = pd.DataFrame({'x': ['a', 'b', 'c'], 'y': [1, 2, 3], 'z': ['d', 'e', 'f']})
df
x y z
0 a 1 d
1 b 2 e
2 c 3 f
df.dtypes
x object
y int64
z object
dtype: object
The idea is to filter out columns which are of object
type. I know this can be done using select_dtypes
, the motivation behind this question is to examine the weird behaviour behind what I'm about to show you.
==
(and as a consequence, .eq
) work for comparing a specific type.
df.dtypes == object
x True
y False
z True
dtype: bool
However, isin
does not:
df.dtypes.isin([object])
df.dtypes.isin(['object'])
x False
y False
z False
dtype: bool
OTOH, creating a np.dtype
object and passing that does .
df.dtypes.isin([np.dtype('O')])
x True
y False
z True
dtype: bool
np.isin
works here, so there's no reason for this to behave any differently.
np.isin(df.dtypes, object)
array([ True, False, True])
np.isin(df.dtypes, 'object')
array([ True, False, True])
isin
seems to be causing trouble when checking for object types only. df.dtypes.isin(['int'])
gives the expected result.
As a side note, I'm running these tests on 0.24.
pd.__version__
'0.24.2'
Is this a bug, or expected behaviour?
This comes down to pandas.Series.isin
relying on hash tables in this case whereas in 0.20.3 this could have gone down a different code path and used np.in1d
depending on your version of python/numpy .
Note that the hashes of np.dtype('O')
and object
are different, which explains the current failure:
In [2]: hash(np.dtype('O'))
Out[2]: 7065344498483383396
In [3]: hash(object)
Out[3]: 108607961
It looks like np.in1d
is doing direct equality comparisons for objects, and the equality with object
/ 'object'
is built into the definition of np.dtype('O')
independent of hashes.
It also illustrates a larger issue with isin
for pandas: objects that compare equally but have different hashes will fail isin
for the small input case. Consider the following class:
class Foo(object):
def __init__(self, hash_val):
self.hash_val = hash_val
def __hash__(self):
return self.hash_val
def __eq__(self, other):
return isinstance(other, Foo)
Then we get:
In [5]: s = pd.Series([Foo(0), Foo(1), Foo(2)])
In [6]: s == Foo(3)
Out[6]:
0 True
1 True
2 True
dtype: bool
In [7]: s.isin([Foo(3)])
Out[7]:
0 False
1 False
2 False
dtype: bool
In [8]: np.in1d(s.values, [Foo(3)])
Out[8]: array([ True, True, True])
Is this a bug? Probably, but I'm guessing it'd be a low priority item to fix, given that this is a bit of a corner case and likely non-trivial to fix in a performant manner (ie the current implementation has a comment indicating that object dtypes shouldn't be passed to np.in1d
as it could raise, so simply delegating to np.in1d
won't work).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.