简体   繁体   中英

Pandas rolling apply on selected rows

How can I apply pandas rolling + apply only to selected rows?

df = pd.DataFrame({'A':range(10)})

# We want the rolling mean values at rows [4,8]
rows_to_select = [4,8]

# We can calculate rolling values of all rows first, then do the selections
roll_mean = df.A.rolling(3).mean()
result = roll_mean[rows_to_select]

But this can not be an option when dealling with a very large dataset, and only a subset of rolling values are needed. Is that possible to do some kind of rolling + selection + apply ?

Using sliding-windowed views

We could create sliding windows as views into the input series to give ourselves a 2D array and then simply index it with the selected rows and compute average values along the second axis of this 2D array. That's the desired output and it's all in a vectorized manner.

To get those sliding-windows, there's an easy builtin in skimage . We will make use of it.

The implementation would be -

from skimage.util.shape import view_as_windows

W = 3 # window length

# Get sliding windows
w = view_as_windows(df['A'].to_numpy(copy=False),W)

# Get selected rows of slding windows. Get mean value.
out_ar = w[np.asarray(rows_to_select)-W+1].mean(1)

# Output as series if we need in that format
out_s = pd.Series(out_ar,index=df.index[rows_to_select])

Alternative to view_as_windows with the intention of keeping it within NumPy, would be strided_app -

w = strided_app(df['A'].to_numpy(copy=False),L=W,S=1)

Extend to all reduction operations

All NumPy ufuncs that support reduction operations could be extended to work with this method, like so -

def rolling_selected_rows(s, rows, W, func):
    # Get sliding windows
    w = view_as_windows(s.to_numpy(copy=False),W)
    
    # Get selected rows of slding windows. Get mean value.
    out_ar = func(w[np.asarray(rows)-W+1],axis=1)
    
    # Output as series if we need in that format
    out_s = pd.Series(out_ar,index=s.index[rows])
    return out_s

Hence, to get rolling min values for the selected rows for the given sample, it would be -

In [91]: rolling_selected_rows(df['A'], rows_to_select, W=3, func=np.min)
Out[91]: 
4    2
8    6
dtype: int64

I feel like you can do with for loop , as you mentioned , when the dataframe is large , if we only need couple of values , there is no benefit for us to run against whole dataframe, especially you need rolling which is considered as memory cost function .

n=3
l=[df.loc[x-n+1:x].mean()[0]for x in rows_to_select]
l
[3.0, 7.0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM