當舊序列缺少數據時，基於Pandas中的條件創建新二進制序列的最有效方法？

Question

當基礎數據為數字或文本並包含缺少的元素時，基於二進制條件創建新的熊貓系列的最有效*方法是什么？

（*高效意味着最大程度地減少RAM利用率和在大型系列上運行的時間）

下面的示例-是否有單個代碼模式同時適用於數字和文本（以及其他dtypes）？ 我在SO建議np.where()上看到了其他問題，但這在缺少數據的情況下給出了錯誤的答案

import pandas as pd
import numpy as np
# create values
s1 = pd.Series(range(10,30))
# create missing
s1[s1 < 12] = np.NaN

# return new series based on binary condition that respects missing data?
# this does not respect missing data
np.where(s1>18, 'adult','not-adult')  # NaN values evaluate to false
# using series.gt does not help
s1.gt(18)

# pd.cut works for numeric data, but what if the underlying data/conditionals were strings? 
pd.cut(s1, bins=[0,18,100],labels=['Young','Old']) # works for numeric

# string example
s2 = pd.Series(['Saturday','Sunday','Monday',np.NaN])
# np.where
np.where(s2.isin(['Saturday','Sunday']), 'weekend','not weekend')  # NaN values evaluate to false

## What code pattern is efficient/elegant that gives desired behavior?
## Output Series should be NaN wherever input Series is NaN

Answer 1

不，沒有單一模式，因為每個選擇在邏輯上都是不同的。

任何==, <, <=, >,或>與至少一個NaN比較都將False 。 對於NaN < 12 ， pandas返回False是正確的，因為這是標准。 偏離這一點需要您自己的邏輯。

使用pd.cut的邏輯與上述相同，但后果不同。 如果s1屬於垃圾箱，則進行分組。 由於NaN不在任何這些bin中，因此不會對NaN進行裝箱，並且輸出為NaN 。

在最后一種情況下， NaN not in ['Saturday', 'Sunday']所以它為False 。

當舊序列缺少數據時，基於Pandas中的條件創建新二進制序列的最有效方法？

問題描述

1 個解決方案

解決方案1
0 已采納 2019-08-23 15:49:58

當舊序列缺少數據時，基於Pandas中的條件創建新二進制序列的最有效方法？

問題描述

1 個解決方案

解決方案1 0 已采納 2019-08-23 15:49:58

解決方案1
0 已采納 2019-08-23 15:49:58