[英]Most efficient method for creating new binary Series based on conditional in Pandas when the old series has missing data?
What is the most efficient* way to create a new pandas series based on a binary condition when the underlying data is numeric or text, and contains missing elements? 当基础数据为数字或文本并包含缺少的元素时,基于二进制条件创建新的熊猫系列的最有效*方法是什么?
(*efficient means minimizing RAM utilization and time to run on a big Series) (*高效意味着最大程度地减少RAM利用率和在大型系列上运行的时间)
Examples below - is there a single code pattern that is optimal for both numeric and text (and other dtypes)? 下面的示例-是否有单个代码模式同时适用于数字和文本(以及其他dtypes)? I have seen other questions on SO suggest np.where()
, but this gives the wrong answer in the presence of missing data 我在SO建议np.where()
上看到了其他问题,但这在缺少数据的情况下给出了错误的答案
import pandas as pd
import numpy as np
# create values
s1 = pd.Series(range(10,30))
# create missing
s1[s1 < 12] = np.NaN
# return new series based on binary condition that respects missing data?
# this does not respect missing data
np.where(s1>18, 'adult','not-adult') # NaN values evaluate to false
# using series.gt does not help
s1.gt(18)
# pd.cut works for numeric data, but what if the underlying data/conditionals were strings?
pd.cut(s1, bins=[0,18,100],labels=['Young','Old']) # works for numeric
# string example
s2 = pd.Series(['Saturday','Sunday','Monday',np.NaN])
# np.where
np.where(s2.isin(['Saturday','Sunday']), 'weekend','not weekend') # NaN values evaluate to false
## What code pattern is efficient/elegant that gives desired behavior?
## Output Series should be NaN wherever input Series is NaN
No, there's not a single pattern because each selection is logically different. 不,没有单一模式,因为每个选择在逻辑上都是不同的。
Any ==, <, <=, >,
or >
comparison with at least one NaN
evaluates to False
. 任何==, <, <=, >,
或>
与至少一个NaN
比较都将False
。 pandas
is correct in returning False
for NaN < 12
because that's the standard. 对于NaN < 12
, pandas
返回False
是正确的,因为这是标准。 Deviating from this requires your own logic. 偏离这一点需要您自己的逻辑。
With pd.cut
it's the same logic as above, but a different consequence. 使用pd.cut
的逻辑与上述相同,但后果不同。 You group if s1
falls within the bins. 如果s1
属于垃圾箱,则进行分组。 Since NaN
is not within any of those bins, NaN
doesn't get binned, and the output is NaN
. 由于NaN
不在任何这些bin中,因此不会对NaN
进行装箱,并且输出为NaN
。
In the final case, NaN
is not in ['Saturday', 'Sunday']
so it's False
. 在最后一种情况下, NaN
not in ['Saturday', 'Sunday']
所以它为False
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.