[英]How to return a pandas dataframe with a rolling window but no additional function applied to it?
我有一个 dataframe 并且我想忽略(替换为 NaN)在滚动 window 中没有足够非 NaN 值的值。 示例 dataframe 可以通过以下方式重新创建:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
for col in df.columns:
df.loc[df.sample(frac=0.25).index, col] = np.nan
A B C D
0 38.0 39.0 NaN 82.0
1 44.0 47.0 NaN NaN
2 NaN 24.0 67.0 NaN
3 96.0 NaN NaN 68.0
4 53.0 NaN 27.0 93.0
我想创建一个滚动的 window,宽度为 4,对于每个 window,我只想在至少有min_periods
非 NaN 值的情况下保留该值。
我认为这将是微不足道的,只需使用:
df.rolling(4, min_periods=2).apply(lambda x: x)
但是,似乎apply
不允许这样的 lambda 函数和pandas.core.base.DataError: No numeric types to aggregate
返回错误。
您可以遍历 windows 并仅保留具有一定数量的 nan 值(或相反)的那些。
windowed_ds = df.rolling(4,min_periods=2)
windows_2_keep = []
for w in windowed_ds:
# total nan values in window
total_is_na_in_window = w.isna().sum().sum()
# keep only windows with more than 2 nan values
if total_is_na_in_window >2:
windows_2_keep.append(w)
# we can also do operations like mean or sum on each window
# window_mean = w.mean().mean()
另一种解决方案是将自定义 function 应用于 window 以查找整个 window 的 nan 值,并根据条件进行任何可能的聚合。 这比 for 循环快得多。
windowed_ds = df.rolling(4,min_periods=2)
def agg_function(ser):
nan_counts = df.loc[ser.index].isna().sum().sum()
print('window',df.loc[ser.index])
print(nan_counts)
# do mean only if the window has at least 2 nan values
if nan_counts>2:
print('window mean',df.loc[ser.index].mean().mean())
print('--------------')
return df.loc[ser.index].mean().mean()
else:
print('window mean',0)
print('--------------')
return 0
# returns a series (or a df based on the agg function of the window) with the
# aggregation result of each window. The selected column "A" is random and it
# just indicates how many times to run the function (agg_function) in the apply method
result = windowed_ds.A.apply(agg_function, raw=False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.