简体   繁体   中英

Pandas rolling time window custom functions with multiple columns

I have timeseries data in a pandas DataFrame that looks like this :

ts                         serial_number    device_tp       tp
2017-09-19T15:00:00.000Z    4ktpjlv     21.7760333333333    17
2017-09-19T14:00:00.000Z    4ktpjlv     19.8849833333333    16
2017-09-19T13:00:00.000Z    4ktpjlv     18.8565818181818    15
2017-09-19T12:00:00.000Z    4ktpjlv     18.7219666666667    13
2017-09-19T11:00:00.000Z    4ktpjlv     18.8341272727273    13
2017-09-19T10:00:00.000Z    4ktpjlv     18.9697833333333    14
2017-09-19T09:00:00.000Z    4ktpjlv     19.0422416666667    14

I'm trying to compute the pearson correlation factor between tp and device_tp and apply the Dynamic Time Warping algorithm (using fastdtw) on each data using a rolling time window. For each sample, I look back to the last 12 hours and calculate the correlation factor and the distance.

I know that pandas as a rolling function, however, it doesn't return a dataframe but a Series (or an array ?). The problem is that both the correlation factor and the fastdtw need two arguments in order to work : df.tp and df.device_tp .

I find another way, using loops to get to what I want :

for key, meas in df.iterrows(): 
   now = meas.ts
   start_date = now - pd.Timedelta(hours=12)
   new_df = df[(df['ts'] >= start_date) & (df['ts'] < now)]   
   if(new_df.shape[0] > 1):
       tp = df.tp.values
       device_tp = df.device_tp.values
       distance, _ = fastdtw(df['tp'], df['device_tp'])
       corr = stats.pearsonr(tp, device_tp)[0]
       # ... Predict flag here
       if(flag == 0):
           output = output.append(meas)

But of course it's really not time efficient ! Also I wonder what would be a better way to do that ? I read some stuff about redefining the rolling function instead of using the pandas build-in one, but can't really see how to do that.

Thanks for helping !

Ok, so an efficient way to get the windowed correlation is df["device_tp"].rolling(12, min_periods=2).corr(other=df["tp"]) .

I also cannot think of a straight-forward way to get the DTW distance. One solution that gives me around 8x speedup is to roll over a pd.Series of indices, and using the resulting indices with Rolling.apply :

from fastdtw import fastdtw 
def rolling_dtw(df, win=12, center=False, min_periods=2,
                col0="ts", col1="A", col2="B"):
    indices = df[col0]
    a = df[col1].values
    b = df[col2].values
    def rolldist(inds):  # calculate DTW for current indices
        inds = inds.astype(int)  # manual type-cast is needed here
        return fastdtw(a[inds], b[inds])[0]

    return indices.rolling(win, center=center,
                           min_periods=min_periods).apply(rolldist)

But that solution also isn't too pretty. The distances between datapoints are assumed to be constant at 1h (in order to use indices). You would need to adjust it if this is not the case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM