当旧序列缺少数据时，基于Pandas中的条件创建新二进制序列的最有效方法？

Question

What is the most efficient* way to create a new pandas series based on a binary condition when the underlying data is numeric or text, and contains missing elements? 当基础数据为数字或文本并包含缺少的元素时，基于二进制条件创建新的熊猫系列的最有效*方法是什么？

(*efficient means minimizing RAM utilization and time to run on a big Series) （*高效意味着最大程度地减少RAM利用率和在大型系列上运行的时间）

Examples below - is there a single code pattern that is optimal for both numeric and text (and other dtypes)? 下面的示例-是否有单个代码模式同时适用于数字和文本（以及其他dtypes）？ I have seen other questions on SO suggest np.where() , but this gives the wrong answer in the presence of missing data 我在SO建议np.where()上看到了其他问题，但这在缺少数据的情况下给出了错误的答案

import pandas as pd
import numpy as np
# create values
s1 = pd.Series(range(10,30))
# create missing
s1[s1 < 12] = np.NaN

# return new series based on binary condition that respects missing data?
# this does not respect missing data
np.where(s1>18, 'adult','not-adult')  # NaN values evaluate to false
# using series.gt does not help
s1.gt(18)

# pd.cut works for numeric data, but what if the underlying data/conditionals were strings? 
pd.cut(s1, bins=[0,18,100],labels=['Young','Old']) # works for numeric

# string example
s2 = pd.Series(['Saturday','Sunday','Monday',np.NaN])
# np.where
np.where(s2.isin(['Saturday','Sunday']), 'weekend','not weekend')  # NaN values evaluate to false

## What code pattern is efficient/elegant that gives desired behavior?
## Output Series should be NaN wherever input Series is NaN

Answer 1

No, there's not a single pattern because each selection is logically different. 不，没有单一模式，因为每个选择在逻辑上都是不同的。

Any ==, <, <=, >, or > comparison with at least one NaN evaluates to False . 任何==, <, <=, >,或>与至少一个NaN比较都将False 。 pandas is correct in returning False for NaN < 12 because that's the standard. 对于NaN < 12 ， pandas返回False是正确的，因为这是标准。 Deviating from this requires your own logic. 偏离这一点需要您自己的逻辑。

With pd.cut it's the same logic as above, but a different consequence. 使用pd.cut的逻辑与上述相同，但后果不同。 You group if s1 falls within the bins. 如果s1属于垃圾箱，则进行分组。 Since NaN is not within any of those bins, NaN doesn't get binned, and the output is NaN . 由于NaN不在任何这些bin中，因此不会对NaN进行装箱，并且输出为NaN 。

In the final case, NaN is not in ['Saturday', 'Sunday'] so it's False . 在最后一种情况下， NaN not in ['Saturday', 'Sunday']所以它为False 。

当旧序列缺少数据时，基于Pandas中的条件创建新二进制序列的最有效方法？

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-08-23 15:49:58

当旧序列缺少数据时，基于Pandas中的条件创建新二进制序列的最有效方法？

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-08-23 15:49:58

解决方案1
0 已采纳 2019-08-23 15:49:58