[英]Pandas boolean operations are inconsistent with one comparison vs. many comparisons
I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. 我试图过滤掉我的数据框中的一些行(> 400000行),其中一列中的值具有None类型。 The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column.
目标是让我的数据框只包含在“列”列中具有浮点值的行。 I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True).
我打算通过传递一系列布尔值来做到这一点,除了我不能正确地构造我的布尔数组(它们都返回True)。
When I run the following operation, given a value of i within the df range, the comparison works: 当我运行以下操作时,给定df范围内的i值,比较有效:
df.loc[i, 'Column'] != None
The rows that have a value of None in 'Column' give the results False. “Column”中值为None的行给出结果False。
But when I run this operation: 但是当我运行此操作时:
df.loc[0:len(df), 'Column'] != None
The boolean array comes back as all True. 布尔数组返回全部为True。
Why is this? 为什么是这样? Is this a pandas bug?
这是一只熊猫虫吗? An edge case?
边缘案例? Intended behaviour for reasons I don't understand?
因我不理解的原因而出于预期的行为?
I can think of other ways to construct my boolean array, though this seems the most efficient. 我可以想到构建我的布尔数组的其他方法,虽然这似乎是最有效的。 But it bothers me that this is the result I am getting.
但令我困扰的是,这是我得到的结果。
Here's a reproducible example of what you're seeing: 以下是您所看到的可重现的示例:
x = pd.Series([1, None, 3, None, None])
print(x != None)
0 True
1 True
2 True
3 True
4 True
dtype: bool
What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None
values to np.nan
: 幕后不太明显Pandas将您的系列转换为数字并将这些
None
值转换为np.nan
:
print(x)
0 1.0
1 NaN
2 3.0
3 NaN
4 NaN
dtype: float64
The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. 然后,系列底层的NumPy数组可以保存在连续的内存块中,并支持向量化操作。 Since
np.nan != np.nan
by design , your Boolean series will contain only True
values, even if you were to test against np.nan
instead of None
. 由于
np.nan != np.nan
的设计 ,你的布尔系列将只包含True
值,即使你要测试np.nan
而不是None
。
For efficiency and correctness, you should use pd.to_numeric
with isnull
/ notnull
for checking null values: 为了提高效率和正确性,你应该使用
pd.to_numeric
与isnull
/ notnull
检查空值:
print(pd.to_numeric(x, errors='coerce').notnull())
0 True
1 False
2 True
3 False
4 False
dtype: bool
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.