简体   繁体   中英

How to return a pandas dataframe with a rolling window but no additional function applied to it?

I have a dataframe and I want to ignore (replace by NaN) values which do not have enough non-NaN values within a rolling window. Example dataframe can be recreated the following way:

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
for col in df.columns:
    df.loc[df.sample(frac=0.25).index, col] = np.nan  

       A     B     C     D
0   38.0  39.0   NaN  82.0
1   44.0  47.0   NaN   NaN
2    NaN  24.0  67.0   NaN
3   96.0   NaN   NaN  68.0
4   53.0   NaN  27.0  93.0

I want to create a rolling window with width of 4 and for each window, I want to only keep the value if there are at least min_periods non-NaN values there.

I thought this would be trivial simply by using:

df.rolling(4, min_periods=2).apply(lambda x: x)

However, it seems apply doesn't allow such lambda functions, and a pandas.core.base.DataError: No numeric types to aggregate error is returned.

You can iterate over the windows and keep only the ones that have a certain amount of nan values (or the opposite).

windowed_ds = df.rolling(4,min_periods=2)
windows_2_keep = []
for w in windowed_ds:
    # total nan values in window
    total_is_na_in_window = w.isna().sum().sum()
    # keep only windows with more than 2 nan values
    if total_is_na_in_window >2:
        windows_2_keep.append(w)
    # we can also do operations like mean or sum on each window
    # window_mean = w.mean().mean()

Another solution is applying a custom function to the window that finds the nan values of the entire window and does any possible aggregation based on a condition. This is much faster that the for loop.

windowed_ds = df.rolling(4,min_periods=2)

def agg_function(ser):
    nan_counts = df.loc[ser.index].isna().sum().sum()
    print('window',df.loc[ser.index])
    
    print(nan_counts)
    # do mean only if the window has at least 2 nan values
    if nan_counts>2:
        print('window mean',df.loc[ser.index].mean().mean())
        print('--------------')
        return df.loc[ser.index].mean().mean()
    else:
        print('window mean',0)
        print('--------------')
        return 0
# returns a series (or a df based on the agg function of the window) with the
# aggregation result of each window. The selected column "A" is random and it
# just indicates how many times to run the function (agg_function) in the apply method

result = windowed_ds.A.apply(agg_function, raw=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM