简体   繁体   English

如何在 pandas 数据框滚动 window 中添加一个步骤

[英]How to add a step to pandas data frame rolling window

I have a dataframe that contains time-series data from a gyroscope, sampled at 20 Hz (every 50ms).我有一个 dataframe,其中包含来自陀螺仪的时间序列数据,以 20 Hz(每 50 毫秒)采样。 I need to use a moving window of 4 seconds to calculate DTW distance from a reference 4 second signal.我需要使用 4 秒的移动 window 来计算与参考 4 秒信号的 DTW 距离。

I'm using this code:我正在使用这段代码:

df['Gyro_Z_DTW']=df['Gyro_Z'].rolling(window='4s',min_periods=80).apply(DTWDistanceWindowed,raw=False)

where the function DTWDistanceWindowed() is the following:其中 function DTWDistanceWindowed()如下:

def DTWDistanceWindowed(entry):
    w=10
    s1=entry
    s2=reference

    DTW={}

    w = max(w, abs(len(s1)-len(s2)))
    print('window = ',w)

    for i in range(-1,len(s1)):
        for j in range(-1,len(s2)):
            DTW[(i, j)] = float('inf') 


    DTW[(-1, -1)] = 0

    for i in range(len(s1)):
        for j in range(max(0, i-w), min(len(s2), i+w)):
            dist= (s1[i]-s2[j])**2
            DTW[(i, j)] = dist + min(DTW[(i-1, j)],DTW[(i, j-1)], DTW[(i-1, j-1)])

    return math.sqrt(DTW[len(s1)-1, len(s2)-1])

# adapted method from #http://alexminnaar.com/2014/04/16/Time-Series-Classification-and-Clustering-with-Python.html

It works, but I can save some time if the moving window can slide by 500 ms each time, instead of 50 ms.它有效,但如果移动 window 每次可以滑动 500 毫秒,而不是 50 毫秒,我可以节省一些时间。

Is there a way to do this?有没有办法做到这一点?

I'm open to other suggestions rather than rolling if you know a better method.如果您知道更好的方法,我愿意接受其他建议而不是滚动。

one way could be to check if the first (or any index really) of entry is a multiple of 500ms and return np.nan if not.一种方法是检查entry的第一个(或任何索引)是否是 500 毫秒的倍数,如果不是则返回np.nan The "complex" calculation will only happen every 500ms then. “复杂”的计算只会每 500 毫秒发生一次。 So the function would be所以 function 将是

def DTWDistanceWindowed(entry):
    if bool(entry.index[0].microsecond%500000):
        return np.nan
    w=10
    s1=entry
    ....# same as your function after

Interestingly, pd.Timestamp (the type of entry.index[0] ) has microsecond attribute but not millisecond, so %500000 is used.有趣的是, pd.Timestampentry.index[0]的类型)有微秒属性但没有毫秒,所以使用了%500000

Edit: now if you want to speed up the function, you can do is using numpy array like this:编辑:现在如果你想加速 function,你可以像这样使用 numpy 数组:

#sample data
np.random.seed(6)
nb = 200
df = pd.DataFrame({'Gyro_Z':np.random.random(nb)}, 
                  index=pd.date_range('2020-05-15', freq='50ms', periods=nb))
reference = np.random.random(10)

# create a for reference with your function
a = df['Gyro_Z'].rolling(window='4s',min_periods=80).apply(DTWDistanceWindowed,raw=False)

Define the function with numpy用 numpy 定义 function

def DTWDistanceWindowed_np(entry):
    if bool(entry.index[0].microsecond%500000):
        return np.nan
    w=10
    s1=entry.to_numpy()
    l1 = len(s1) # calcualte once the len of s1
    # definition of s2 and its length
    s2 = np.array(reference) 
    l2 = len(s2)

    w = max(w, abs(l1-l2))

    # create an array of inf and initialise
    DTW=np.full((l1+1,l2+1), np.inf)
    DTW[0, 0] = 0

    # avoid calculate some difference several times
    s1ms2 = (s1[:,None]-s2)**2
    # do the loop same way, note the small change in bounds
    for i in range(1,l1+1):
        for j in range(max(1, i-w), min(l2+1, i+w)):
            DTW[i, j] = s1ms2[i-1,j-1] + min(DTW[i-1, j],DTW[i, j-1], DTW[i-1, j-1])

    return math.sqrt(DTW[l1, l2])

# use it to create b
b = df['Gyro_Z'].rolling(window='4s',min_periods=80).apply(DTWDistanceWindowed_np,raw=False)

# compare a every 10 rows and b and mot the nan rows
print ((b.dropna() == a.dropna()[::10]).all())
# True

Timing:定时:

#original solution
%timeit df['Gyro_Z'].rolling(window='4s',min_periods=80).apply(DTWDistanceWindowed,raw=False)
3.31 s ± 422 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# with numpy and 1 out of 10 rows
%timeit df['Gyro_Z'].rolling(window='4s',min_periods=80).apply(DTWDistanceWindowed_np,raw=False)
41.7 ms ± 9.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

so doing the if bool(... was already a gain of almost 10 time faster, but using numpy is another 9 time faster. The speed up may depends on the size of reference, I have not check this really.所以这样做if bool(...已经快了将近 10 倍,但是使用numpy了 9 倍。加速可能取决于参考的大小,我还没有真正检查过。

Can you resample to 500ms before applying the rolling function?您可以在应用滚动 function 之前重新采样到 500 毫秒吗?

df['Gyro_Z'].resample('500ms').max().rolling(window='4s',min_periods=80).apply(DTWDistanceWindowed,raw=False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM