简体   繁体   中英

Fast conditional overlapping windowing (framing) of numpy array

I have a huge list of numpy arrays (1 dimensional), which are time series for different events. Each point has a label, and I want to window the numpy arrays based on its label. The labels I have is 0, 1, and 2. Each window has a fixed size M.

The label of each window will be the biggest label available in the window. So if a window consists of both 0 an 1 labeled datapoints, the label will be 1 for the whole window.

But the problem is that, the windowing is not label agnostic. Because of class imbalance, I want to only do overlapped windowing in case of labels 1 and 2.

So far I have written this code:

# conditional framing
data = []
start_cursor = 0
while start_cursor < arr.size:
  end_cursor = start_cursor + window_size
  data.append(
    {
      "frame": arr[start_cursor:end_cursor],
      "label": y[start_cursor:end_cursor].max(),
    }
  )
  start_cursor = end_cursor
  if np.any(y[start_cursor, end_cursor] != 0):
    start_cursor = start_cursor - overlap_size        

But this is clearly too verbose and just plain inefficient, especially because I will call this while loop on my huge list of separate arrays.

EDIT: to explain the problem more. Imagine you are going to window a signal with fixed length M. If there only exists 0 label points in the window, there will be no overlap between adjacent windows. But if there exists labels 1 and 2, there will be an overlap between two signals with percentage p%.

I think this does what you are asking to do. The visualization for checking isn't great, but it helps you see how the windowing works. Hopefully I understood your question right and this is what you are trying to do. Anytime there is a 1 or 2 in the time series (rather than a 0) the window steps forward some fraction of the full window length (here 50%). 窗口方法

To examine how to do this, start with a sample time series:

import matplotlib.pylab as plt
import numpy as np

N = 5000 # time series length

# create some sort of data set to work with
x = np.zeros(N)
# add a few 1s and 2s to the list (though really they are the same for the windowing)
y = np.random.random(N)
x[y < 0.01] = 1
x[y < 0.005] = 2

# assign a window length
M = 50 # window length
overlap = 0.5 # assume 50% overlap
M_overlap = int(M * (1-overlap))

My approach is to sum the window of interest over your time series. If the sum ==0 , there is no overlap between windows and if it is >0 then there is overlap. The question, then, becomes how should we calculate these sums efficiently? I compare two approaches. The first is simply to walk through the time series and the second is to use convolution (which is much faster). For the first one, I also explore different ways of assessing window size after summation.

Summation (slow version)

def window_sum1():
    # start of windows in list windows
    windows = [0,]
    while windows[-1] + M < N:
        check = sum(x[windows[-1]:windows[-1]+M]) == 0
        windows.append(windows[-1] + M_overlap + (M - M_overlap) * check)
        if windows[-1] + M > N:
            windows.pop()
            break
    # plotting stuff for checking
    return(windows)
Niter = 10**4
print(timeit.timeit(window_sum1, number = Niter))
# 29.201083058

So this approach went through 10,000 time series of length 5000 in about 30 seconds. But the line windows.append(windows[-1] + M_overlap + (M - M_overlap) * check) can be streamlined in an if statement.

Summation (fast version, 33% faster than slow version)

def window_sum2():
    # start of windows in list windows
    windows = [0,]
    while windows[-1] + M < N:
        check = sum(x[windows[-1]:windows[-1]+M]) == 0
        if check:
            windows.append(windows[-1] + M)
        else:
            windows.append(windows[-1] + M_overlap)
        if windows[-1] + M > N:
            windows.pop()
            break
    # plotting stuff for checking
    return(windows)
print(timeit.timeit(window_sum2, number = Niter))
# 20.456240447000003

We see a 1/3 reduction in time with the if statement.

Convolution (85% faster than fast summation)

We can use signal processing to get a lot faster, by convolving the time series with the window of interest using numpy.convolve . (Disclaimer: I got the idea from the accepted answer to this question .) Of course, it also makes sense to adopt the faster window size assessment from above.

def window_conv():
    a = np.convolve(x,np.ones(M,dtype=int),'valid')
    windows = [0,]
    while windows[-1] + M < N:
        if a[windows[-1]]:
            windows.append(windows[-1] + M_overlap)
        else:
            windows.append(windows[-1] + M)
        if windows[-1] + M > N:
            windows.pop()
            break
    return(windows)
print(timeit.timeit(window_conv, number = Niter))
#3.3695770570000008

Sliding window

The last thing I will add is that, as shown in one of the comments of this question , as of numpy 1.20 there is a function called sliding_window_view . I still have numpy 1.19 running and was not able to test it to see if it's faster than convolution.

At first, I think you should revise this line if np.any(y[start_cursor, end_cursor] != 0): to if np.any(y[start_cursor:end_cursor] != 0):

Any way, I think we can revise your code at some points.

Firstly, you can revise this part :

 if np.any(y[start_cursor: end_cursor] != 0):
        start_cursor = start_cursor - overlap_size

before these lines you have calculated y[start_cursor:end_cursor].max() so you know is there any label that is bigger than 0 or not. so this is a better one:

  if data[-1]['label'] != 0):
            start_cursor -= overlap_size

Although, the better way is that you set y[start_cursor:end_cursor].max() into the value for using for setting 'label' and checking "if expression"

Secondly, You used "append" for data. It is so inefficient. the best way is to allocate frames with zero (you have fix size for your frame and you know maximum number of frame is maxNumFrame=np.ceil((arr.size-overlap_size)/(window_size-overlap_size)) . So, you should initialize frames=np.zeros((maxNumFrame,window_size)) at the first step, then you change frames in the "while" or if you want to use your customized structure, you can initialize your list with zero value, then change values in "while"

Thirdly, the best way is that in a while you calculate "start_cursor" and y and set them into the array of tuple or 2 arrays. ("end_cursor" is redundant) After that, make the frames by using "map" in one of the ways that I said. (In one array or your customized structure)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM