简体   繁体   中英

Most efficient method for creating new binary Series based on conditional in Pandas when the old series has missing data?

What is the most efficient* way to create a new pandas series based on a binary condition when the underlying data is numeric or text, and contains missing elements?

(*efficient means minimizing RAM utilization and time to run on a big Series)

Examples below - is there a single code pattern that is optimal for both numeric and text (and other dtypes)? I have seen other questions on SO suggest np.where() , but this gives the wrong answer in the presence of missing data

import pandas as pd
import numpy as np
# create values
s1 = pd.Series(range(10,30))
# create missing
s1[s1 < 12] = np.NaN

# return new series based on binary condition that respects missing data?
# this does not respect missing data
np.where(s1>18, 'adult','not-adult')  # NaN values evaluate to false
# using series.gt does not help
s1.gt(18)

# pd.cut works for numeric data, but what if the underlying data/conditionals were strings? 
pd.cut(s1, bins=[0,18,100],labels=['Young','Old']) # works for numeric

# string example
s2 = pd.Series(['Saturday','Sunday','Monday',np.NaN])
# np.where
np.where(s2.isin(['Saturday','Sunday']), 'weekend','not weekend')  # NaN values evaluate to false

## What code pattern is efficient/elegant that gives desired behavior?
## Output Series should be NaN wherever input Series is NaN

No, there's not a single pattern because each selection is logically different.

Any ==, <, <=, >, or > comparison with at least one NaN evaluates to False . pandas is correct in returning False for NaN < 12 because that's the standard. Deviating from this requires your own logic.

With pd.cut it's the same logic as above, but a different consequence. You group if s1 falls within the bins. Since NaN is not within any of those bins, NaN doesn't get binned, and the output is NaN .

In the final case, NaN is not in ['Saturday', 'Sunday'] so it's False .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM