简体   繁体   English

熊猫:提高滚动窗口的速度(应用自定义功能)

[英]Pandas: increase speed of rolling window (apply a custom function)

I'm using this code to apply a function ( funcX ) on my data-frame using a rolling window. 我正在使用此代码使用滚动窗口在我的数据框上应用函数( funcX )。 The main issue is that the size of this data-frame ( data ) is very large, and I'm searching for a faster way to do this task. 主要问题是这个数据框( data )的大小非常大,我正在寻找一种更快的方法来完成这项任务。

import numpy as np

def funcX(x):
    x = np.sort(x)
    xd = np.delete(x, 25)
    med = np.median(xd)
    return (np.abs(x - med)).mean() + med

med_out = data.var1.rolling(window = 51, center = True).apply(funcX, raw = True)

The only reason for using this function is that the calculated median is the median after removing the middle value. 使用此函数的唯一原因是计算出的中位数是删除中间值后的中位数。 So it's different with adding .median() at the end of the rolling window. 因此,在滚动窗口的末尾添加.median()有所不同。

To be effective, a window algorithm must link the results of two overlaying windows. 为了有效,窗口算法必须链接两个重叠窗口的结果。

Here, with : med0 the median, med the median of x \\ med0 , xl elements before med and xg elements after med in the sorted elements, funcX(x) can be seen as : 在这里,使用: med0中位数, med中间的x \\ med0xl元素在med之前, xg元素在med之后的有序元素中, funcX(x)可以看作:

<|x-med|> + med = [sum(xg) - sum(xl) - |med0-med|] / windowsize + med  

So an idea it to maintain a buffer which represents the sorted current window, sum(xg) and sum(xl) . 因此,它想要维护一个缓冲区,它代表排序的当前窗口, sum(xg)sum(xl) Using Numba just in time compilation, very good performance arise here. 使用Numba及时编译,这里出现了非常好的性能。

First the buffer management: 首先是缓冲管理:

init sorts the first window and compute left( xls ) and right( xgs ) sums. init对第一个窗口进行排序并计算左( xls )和右( xgs )和。

import numpy as np
import numba
windowsize = 51 #odd, >1
halfsize = windowsize//2

@numba.njit
def init(firstwindow):
    buffer = np.sort(firstwindow)
    xls = buffer[:halfsize].sum()
    xgs = buffer[-halfsize:].sum()   
    return buffer,xls,xgs

shift is the linear part. shift是线性部分。 It updates the buffer, maintaining it sorted . 它会更新缓冲区,并对其进行排序。 np.searchsorted computes positions of insertion and deletion in O(log(windowsize)) . np.searchsorted计算O(log(windowsize))中插入和删除的位置。 It's technical since xin<xout and xout<xin are not symmetrical situations. 这是技术性的,因为xin<xoutxout<xin不是对称的情况。

@numba.njit
def shift(buffer,xin,xout):
    i_in = np.searchsorted(buffer,xin) 
    i_out = np.searchsorted(buffer,xout)
    if xin <= xout :
        buffer[i_in+1:i_out+1] = buffer[i_in:i_out] 
        buffer[i_in] = xin                        
    else:
        buffer[i_out:i_in-1] = buffer[i_out+1:i_in]                      
        buffer[i_in-1] = xin
    return i_in, i_out

update updates the buffer and the sums of left and right parts. update更新缓冲区以及左右部分的总和。 It's technical since xin<xout and xout<xin are not symmetrical situations. 这是技术性的,因为xin<xoutxout<xin不是对称的情况。

@numba.njit
def update(buffer,xls,xgs,xin,xout):
    xl,x0,xg = buffer[halfsize-1:halfsize+2]
    i_in,i_out = shift(buffer,xin,xout)

    if i_out < halfsize:
        xls -= xout
        if i_in <= halfsize:
            xls += xin
        else:    
            xls += x0
    elif i_in < halfsize:
        xls += xin - xl

    if i_out > halfsize:
        xgs -= xout
        if i_in > halfsize:
            xgs += xin
        else:    
            xgs += x0
    elif i_in > halfsize+1:
        xgs += xin - xg

    return buffer, xls, xgs

func is equivalent to the original funcX on buffer. func等效于缓冲区上的原始funcX O(1) . O(1)

@numba.njit       
def func(buffer,xls,xgs):
    med0 = buffer[halfsize]
    med  = (buffer[halfsize-1] + buffer[halfsize+1])/2
    if med0 > med:
        return (xgs-xls+med0-med) / windowsize + med
    else:               
        return (xgs-xls+med-med0) / windowsize + med    

med is the global function. med是全球功能。 O(data.size * windowsize) . O(data.size * windowsize)

@numba.njit
def med(data):
    res = np.full_like(data, np.nan)
    state = init(data[:windowsize])
    res[halfsize] = func(*state)
    for i in range(windowsize, data.size):
        xin,xout = data[i], data[i - windowsize]
        state = update(*state, xin, xout)
        res[i-halfsize] = func(*state)
    return res 

Performance : 表现:

import pandas
data=pandas.DataFrame(np.random.rand(10**5))

%time res1=data[0].rolling(window = windowsize, center = True).apply(funcX, raw = True)
Wall time: 10.8 s

res2=med(data[0].values)

np.allclose((res1-res2)[halfsize:-halfsize],0)
Out[112]: True

%timeit res2=med(data[0].values)
40.4 ms ± 462 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

it's ~ 250X faster, with window size = 51. An hour becomes 15 seconds. 它的速度快〜250倍,窗口大小= 51.一小时变为15秒。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM