简体   繁体   English

Pandas布尔运算与一次比较与许多比较不一致

[英]Pandas boolean operations are inconsistent with one comparison vs. many comparisons

I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. 我试图过滤掉我的数据框中的一些行(> 400000行),其中一列中的值具有None类型。 The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column. 目标是让我的数据框只包含在“列”列中具有浮点值的行。 I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True). 我打算通过传递一系列布尔值来做到这一点,除了我不能正确地构造我的布尔数组(它们都返回True)。

When I run the following operation, given a value of i within the df range, the comparison works: 当我运行以下操作时,给定df范围内的i值,比较有效:

df.loc[i, 'Column'] != None 

The rows that have a value of None in 'Column' give the results False. “Column”中值为None的行给出结果False。

But when I run this operation: 但是当我运行此操作时:

df.loc[0:len(df), 'Column'] != None 

The boolean array comes back as all True. 布尔数组返回全部为True。

Why is this? 为什么是这样? Is this a pandas bug? 这是一只熊猫虫吗? An edge case? 边缘案例? Intended behaviour for reasons I don't understand? 因我不理解的原因而出于预期的行为?

I can think of other ways to construct my boolean array, though this seems the most efficient. 我可以想到构建我的布尔数组的其他方法,虽然这似乎是最有效的。 But it bothers me that this is the result I am getting. 但令我困扰的是,这是我得到的结果。

Here's a reproducible example of what you're seeing: 以下是您所看到的可重现的示例:

x = pd.Series([1, None, 3, None, None])

print(x != None)

0    True
1    True
2    True
3    True
4    True
dtype: bool

What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None values to np.nan : 幕后不太明显Pandas将您的系列转换为数字并将这些None值转换为np.nan

print(x)

0    1.0
1    NaN
2    3.0
3    NaN
4    NaN
dtype: float64

The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. 然后,系列底层的NumPy数组可以保存在连续的内存块中,并支持向量化操作。 Since np.nan != np.nan by design , your Boolean series will contain only True values, even if you were to test against np.nan instead of None . 由于np.nan != np.nan的设计 ,你的布尔系列将只包含True值,即使你要测试np.nan而不是None

For efficiency and correctness, you should use pd.to_numeric with isnull / notnull for checking null values: 为了提高效率和正确性,你应该使用pd.to_numericisnull / notnull检查空值:

print(pd.to_numeric(x, errors='coerce').notnull())

0     True
1    False
2     True
3    False
4    False
dtype: bool

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM