简体   繁体   中英

optimization rolling window pandas dataframe

I have a pandas Series like this:

dist
0.02422
0.03267
0.04208
0.05229
0.06291
...

It has almost 200K row. Is there a more efficient way to perform the following operation:

df["dist"].rolling(3000).apply(lambda x: x.iloc[-1] - x[x != 0].iloc[0] if x[x != 0].shape[0] else 0).dropna().max()

Pratically, I need to compute the difference between the first non zero and the last value of each window. The code above works, but I would like to know if there is a more efficient way to do the same operation.

Thanks for your help.

The problem with you code is that it performs too many comparison against 0. Since the value in the dist column does not change, compare it once and reuse the result.

Here's on way to do it:

# The scale of our problem
n = 200_000
window_size = 3_000

# Some simulated data. 20% of `dist` is zero
n_zero = int(n * 0.2)
dist = np.hstack([np.zeros(n_zero), np.random.uniform(0, 1, n - n_zero)])
np.random.shuffle(dist)

df = pd.DataFrame({
    'dist': dist
})

# -----------------

# Convert dist to a numpy array. We do not need pandas series here
dist = df['dist'].to_numpy()

# Find the indexes of all non-zero elements. nz = non-zero
nz_index, = np.nonzero(dist)

# For each row in `dist`, find the first non-zero value within
# the next `window_size` rows. Return NaN if no such value
# can be found.
dist_nz = np.empty_like(dist)
idx = 0
for i in range(len(dist)):
    idx += 1 if i > nz_index[idx] else 0
    dist_nz[i] = dist[nz_index[idx]] if idx < len(nz_index) and (idx - i < window_size) else np.nan

(dist[window_size-1:] - dist_nz[:-window_size+1]).max()

This completes in 0.3s compared to 2m 13s of the original.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM