What is the most efficient* way to create a new pandas series based on a binary condition when the underlying data is numeric or text, and contains missing elements?
(*efficient means minimizing RAM utilization and time to run on a big Series)
Examples below - is there a single code pattern that is optimal for both numeric and text (and other dtypes)? I have seen other questions on SO suggest np.where()
, but this gives the wrong answer in the presence of missing data
import pandas as pd
import numpy as np
# create values
s1 = pd.Series(range(10,30))
# create missing
s1[s1 < 12] = np.NaN
# return new series based on binary condition that respects missing data?
# this does not respect missing data
np.where(s1>18, 'adult','not-adult') # NaN values evaluate to false
# using series.gt does not help
s1.gt(18)
# pd.cut works for numeric data, but what if the underlying data/conditionals were strings?
pd.cut(s1, bins=[0,18,100],labels=['Young','Old']) # works for numeric
# string example
s2 = pd.Series(['Saturday','Sunday','Monday',np.NaN])
# np.where
np.where(s2.isin(['Saturday','Sunday']), 'weekend','not weekend') # NaN values evaluate to false
## What code pattern is efficient/elegant that gives desired behavior?
## Output Series should be NaN wherever input Series is NaN
No, there's not a single pattern because each selection is logically different.
Any ==, <, <=, >,
or >
comparison with at least one NaN
evaluates to False
. pandas
is correct in returning False
for NaN < 12
because that's the standard. Deviating from this requires your own logic.
With pd.cut
it's the same logic as above, but a different consequence. You group if s1
falls within the bins. Since NaN
is not within any of those bins, NaN
doesn't get binned, and the output is NaN
.
In the final case, NaN
is not in ['Saturday', 'Sunday']
so it's False
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.