I have a dataframe and I want to ignore (replace by NaN) values which do not have enough non-NaN values within a rolling window. Example dataframe can be recreated the following way:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
for col in df.columns:
df.loc[df.sample(frac=0.25).index, col] = np.nan
A B C D
0 38.0 39.0 NaN 82.0
1 44.0 47.0 NaN NaN
2 NaN 24.0 67.0 NaN
3 96.0 NaN NaN 68.0
4 53.0 NaN 27.0 93.0
I want to create a rolling window with width of 4 and for each window, I want to only keep the value if there are at least min_periods
non-NaN values there.
I thought this would be trivial simply by using:
df.rolling(4, min_periods=2).apply(lambda x: x)
However, it seems apply
doesn't allow such lambda functions, and a pandas.core.base.DataError: No numeric types to aggregate
error is returned.
You can iterate over the windows and keep only the ones that have a certain amount of nan values (or the opposite).
windowed_ds = df.rolling(4,min_periods=2)
windows_2_keep = []
for w in windowed_ds:
# total nan values in window
total_is_na_in_window = w.isna().sum().sum()
# keep only windows with more than 2 nan values
if total_is_na_in_window >2:
windows_2_keep.append(w)
# we can also do operations like mean or sum on each window
# window_mean = w.mean().mean()
Another solution is applying a custom function to the window that finds the nan values of the entire window and does any possible aggregation based on a condition. This is much faster that the for loop.
windowed_ds = df.rolling(4,min_periods=2)
def agg_function(ser):
nan_counts = df.loc[ser.index].isna().sum().sum()
print('window',df.loc[ser.index])
print(nan_counts)
# do mean only if the window has at least 2 nan values
if nan_counts>2:
print('window mean',df.loc[ser.index].mean().mean())
print('--------------')
return df.loc[ser.index].mean().mean()
else:
print('window mean',0)
print('--------------')
return 0
# returns a series (or a df based on the agg function of the window) with the
# aggregation result of each window. The selected column "A" is random and it
# just indicates how many times to run the function (agg_function) in the apply method
result = windowed_ds.A.apply(agg_function, raw=False)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.