How can I apply pandas rolling
+ apply
only to selected rows?
df = pd.DataFrame({'A':range(10)})
# We want the rolling mean values at rows [4,8]
rows_to_select = [4,8]
# We can calculate rolling values of all rows first, then do the selections
roll_mean = df.A.rolling(3).mean()
result = roll_mean[rows_to_select]
But this can not be an option when dealling with a very large dataset, and only a subset of rolling values are needed. Is that possible to do some kind of rolling
+ selection
+ apply
?
We could create sliding windows as views into the input series to give ourselves a 2D
array and then simply index it with the selected rows and compute average values along the second axis of this 2D
array. That's the desired output and it's all in a vectorized manner.
To get those sliding-windows, there's an easy builtin in skimage
. We will make use of it.
The implementation would be -
from skimage.util.shape import view_as_windows
W = 3 # window length
# Get sliding windows
w = view_as_windows(df['A'].to_numpy(copy=False),W)
# Get selected rows of slding windows. Get mean value.
out_ar = w[np.asarray(rows_to_select)-W+1].mean(1)
# Output as series if we need in that format
out_s = pd.Series(out_ar,index=df.index[rows_to_select])
Alternative to view_as_windows
with the intention of keeping it within NumPy, would be strided_app
-
w = strided_app(df['A'].to_numpy(copy=False),L=W,S=1)
Extend to all reduction operations
All NumPy ufuncs that support reduction operations could be extended to work with this method, like so -
def rolling_selected_rows(s, rows, W, func):
# Get sliding windows
w = view_as_windows(s.to_numpy(copy=False),W)
# Get selected rows of slding windows. Get mean value.
out_ar = func(w[np.asarray(rows)-W+1],axis=1)
# Output as series if we need in that format
out_s = pd.Series(out_ar,index=s.index[rows])
return out_s
Hence, to get rolling min
values for the selected rows for the given sample, it would be -
In [91]: rolling_selected_rows(df['A'], rows_to_select, W=3, func=np.min)
Out[91]:
4 2
8 6
dtype: int64
I feel like you can do with for loop , as you mentioned , when the dataframe is large , if we only need couple of values , there is no benefit for us to run against whole dataframe, especially you need rolling which is considered as memory cost function .
n=3
l=[df.loc[x-n+1:x].mean()[0]for x in rows_to_select]
l
[3.0, 7.0]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.