简体   繁体   English

Python:可变宽度的滑动窗口

[英]Python: sliding window of variable width

I'm writing a program in Python that's processing some data generated during experiments, and it needs to estimate the slope of the data. 我正在用Python编写一个程序,该程序处理实验期间生成的一些数据,并且需要估计数据的斜率。 I've written a piece of code that does this quite nicely, but it's horribly slow (and I'm not very patient). 我编写了一段代码,可以很好地做到这一点,但是速度非常慢(我也不是很耐心)。 Let me explain how this code works: 让我解释一下这段代码是如何工作的:

1) It grabs a small piece of data of size dx (starting with 3 datapoints) 1)它捕获一小块大小为dx的数据(从3个数据点开始)

2) It evaluates whether the difference (ie |y(x+dx)-y(x-dx)| ) is larger than a certain minimum value (40x std. dev. of noise) 2)评估差异(即| y(x + dx)-y(x-dx)|)是否大于某个最小值(噪声的40x std。dev。)

3) If the difference is large enough, it will calculate the slope using OLS regression. 3)如果差异足够大,它将使用OLS回归计算斜率。 If the difference is too small, it will increase dx and redo the loop with this new dx 如果差异太小,它将增加dx并使用此新dx重做循环

4) This continues for all the datapoints 4)对所有数据点继续

[See updated code further down] [请参阅更新后的代码]

For a datasize of about 100k measurements, this takes about 40 minutes, whereas the rest of the program (it does more processing than just this bit) takes about 10 seconds. 对于大约10万次测量的数据大小,这大约需要40分钟,而程序的其余部分(不仅仅是该位,它会执行更多的处理)大约需要10秒。 I am certain there is a much more efficient way of doing these operations, could you guys please help me out? 我敢肯定有一种更有效的方法来进行这些操作,请您帮我一下吗?

Thanks 谢谢

EDIT: 编辑:

Ok, so I've got the problem solved by using only binary searches, limiting the number of allowed steps by 200. I thank everyone for their input and I selected the answer that helped me most. 好的,所以我只用二进制搜索就解决了问题,将允许的步骤数限制为200。我感谢大家的投入,并选择了对我最有帮助的答案。

FINAL UPDATED CODE: 最终更新代码:

def slope(self, data, time):
    (wave1, wave2) = wt.dwt(data, "db3")
    std = 2*np.std(wave2)
    e = std/0.05
    de = 5*std
    N = len(data)
    slopes = np.ones(shape=(N,))
    data2 = np.concatenate((-data[::-1]+2*data[0], data, -data[::-1]+2*data[N-1]))
    time2 = np.concatenate((-time[::-1]+2*time[0], time, -time[::-1]+2*time[N-1]))
    for n in xrange(N+1, 2*N):     
        left = N+1
        right = 2*N
        for i in xrange(200):
            mid = int(0.5*(left+right))
            diff = np.abs(data2[n-mid+N]-data2[n+mid-N])
            if diff >= e:
                if diff < e + de:  
                    break
                right = mid - 1
                continue
            left = mid + 1
        leftlim = n - mid + N
        rightlim = n + mid - N
        y = data2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
        x = time2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
        xavg = np.average(x)
        yavg = np.average(y)
        xlen = len(x)
        slopes[n-N] = (np.dot(x,y)-xavg*yavg*xlen)/(np.dot(x,x)-xavg*xavg*xlen)
    return np.array(slopes) 

How to optimize this will depend on some properties of your data, but here are some ideas: 如何优化它取决于您数据的某些属性,但是这里有一些想法:

  1. Have you tried profiling the code? 您是否尝试过分析代码? Using one of the Python profilers can give you some useful information about what's taking the most time. 使用其中一个Python分析器可以为您提供一些有关花费最多时间的有用信息。 Often, a piece of code you've just written will have one biggest bottleneck, and it's not always obvious which piece it is; 通常,您刚刚编写的一段代码会有一个最大的瓶颈,并且并不总是很清楚这是哪段代码。 profiling lets you figure that out and attack the main bottleneck first. 通过分析,您可以找出问题并首先解决主要瓶颈。

  2. Do you know what typical values of i are? 你知道什么样的典型值i是谁? If you have some idea, you can speed things up by starting with i greater than 0 (as @vhallac noted), or by increasing i by larger amounts — if you often see big values for i , increase i by 2 or 3 at a time; 如果你有一些想法,你可以通过开始加快速度, i大于0(如@vhallac说明),或通过增加i通过更大量的-如果你经常看到大值i ,增加i在2或3时间; if the distribution of i s has a long tail, try doubling it each time; 如果i s的分布尾巴较长,则尝试每次将其加倍; etc. 等等

  3. Do you need all the data when doing the least squares regression? 进行最小二乘回归时是否需要所有数据? If that function call is the bottleneck, you may be able to speed it up by using only some of the data in the range. 如果该函数调用是瓶颈,则可以仅使用范围内的某些数据来加快速度。 Suppose, for instance, that at a particular point, you need i to be 200 to see a large enough (above-noise) change in the data. 例如,假设在某个特定点,您需要i为200,才能看到数据中足够大(噪声较大)的变化。 But you may not need all 400 points to get a good estimate of the slope — just using 10 or 20 points, evenly spaced in the start:end range, may be sufficient, and might speed up the code a lot. 但是您可能不需要全部400个点就可以很好地估计斜率-仅使用10或20个点(在start:end范围内均匀间隔) start:end足够了,并且可以大大加快代码的速度。

Your comments suggest that you need to find a better method to estimate i k+1 given i k . 您的评论建议您需要找到一种更好的方法来估计给定的i k + 1 No knowledge of values in data would yield to the naive algorithm: 没有data值的知识会屈服于朴素的算法:

At each iteration for n , leave i at previous value, and see if the abs(data[start]-data[end]) value is less than e . n每次迭代中,将i保留为先前的值,然后查看abs(data[start]-data[end])值是否小于e If it is, leave i at its previous value, and find your new one by incrementing it by 1 as you do now. 如果是的话,请将i保留为之前的值,然后像现在一样通过将其递增1来找到新的值。 If it is greater, or equal, do a binary search on i to find the appropriate value. 如果大于或等于,请对i进行二进制搜索以找到适当的值。 You can possibly do a binary search forwards, but finding a good candidate upper limit without knowledge of data can prove to be difficult. 您可能可以进行二进制搜索,但是在没有data知识的情况下找到合适的候选上限可能很困难。 This algorithm won't perform worse than your current estimation method. 该算法的性能不会比您当前的估算方法差。

If you know that data is kind of smooth (no sudden jumps, and hence a smooth plot for all i values) and monotonically increasing, you can replace the binary search with a search backwards by decrementing its value by 1 instead. 如果您知道data是平滑的(没有突然跳跃,因此所有i值的平滑图)并且单调递增,则可以通过将其值递减1来将二进制搜索替换为向后搜索。

I work with Python for similar analyses, and have a few suggestions to make. 我使用Python进行类似的分析,并提出了一些建议。 I didn't look at the details of your code, just to your problem statement: 我没有查看您的代码的详细信息,只是查看您的问题说明:

1) It grabs a small piece of data of size dx (starting with 3 datapoints) 1)它捕获一小块大小为dx的数据(从3个数据点开始)

2) It evaluates whether the difference (ie |y(x+dx)-y(x-dx)| ) is larger than a certain minimum value (40x std. dev. of noise) 2)评估差异(即| y(x + dx)-y(x-dx)|)是否大于某个最小值(噪声的40x std。dev。)

3) If the difference is large enough, it will calculate the slope using OLS regression. 3)如果差异足够大,它将使用OLS回归计算斜率。 If the difference is too small, it will increase dx and redo the loop with this new dx 如果差异太小,它将增加dx并使用此新dx重做循环

4) This continues for all the datapoints 4)对所有数据点继续

I think the more obvious reason for slow execution is the LOOPING nature of your code, when perhaps you could use the VECTORIZED (array-based operations) nature of Numpy. 我认为执行缓慢的最明显原因是代码的循环性质,也许您可​​以使用Numpy的VECTORIZED(基于数组的操作)性质。

For step 1, instead of taking pairs of points, you can perform directly `data[3:] - data[-3:] and get all the differences in a single array operation; 对于第1步,您可以直接执行`data [3:]-data [-3:]并获得所有差值,而无需获取成对的点;只需执行一次数组操作即可;

For step 2, you can use the result from array-based tests like numpy.argwhere(data > threshold) instead of testing every element inside some loop; 对于第2步,您可以使用基于数组的测试(例如numpy.argwhere(data > threshold)而不是测试某个循环中的每个元素。

Step 3 sounds conceptually wrong to me. 步骤3在概念上听起来对我来说是错误的。 You say that if the difference is too small, it will increase dx . 您说如果差异太小,它将增加dx But if the difference is small, the resulting slope would be small because it IS actually small. 但是,如果差异很小,则最终的斜率将很小,因为它实际上很小。 Then, getting a small value is the right result, and artificially increasing dx to get a "better" result might not be what you want. 然后,获得较小的值是正确的结果,并且人为地增加dx以获得“更好”的结果可能不是您想要的。 Well, it might actually be what you want, but you should consider this. 好吧,这实际上可能是您想要的,但是您应该考虑一下。 I would suggest that you calculate the slope for a fixed dx across the whole data, and then take the resulting array of slopes to select your regions of interest (for example, using data_slope[numpy.argwhere(data_slope > minimum_slope)] . 我建议您为整个数据上的固定dx计算斜率,然后采用所得的斜率数组来选择您感兴趣的区域(例如,使用data_slope[numpy.argwhere(data_slope > minimum_slope)]

Hope this helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM