[英]Pandas: increase speed of rolling window (apply a custom function)
I'm using this code to apply a function ( funcX
) on my data-frame using a rolling window. 我正在使用此代码使用滚动窗口在我的数据框上应用函数(
funcX
)。 The main issue is that the size of this data-frame ( data
) is very large, and I'm searching for a faster way to do this task. 主要问题是这个数据框(
data
)的大小非常大,我正在寻找一种更快的方法来完成这项任务。
import numpy as np
def funcX(x):
x = np.sort(x)
xd = np.delete(x, 25)
med = np.median(xd)
return (np.abs(x - med)).mean() + med
med_out = data.var1.rolling(window = 51, center = True).apply(funcX, raw = True)
The only reason for using this function is that the calculated median is the median after removing the middle value. 使用此函数的唯一原因是计算出的中位数是删除中间值后的中位数。 So it's different with adding
.median()
at the end of the rolling window. 因此,在滚动窗口的末尾添加
.median()
有所不同。
To be effective, a window algorithm must link the results of two overlaying windows. 为了有效,窗口算法必须链接两个重叠窗口的结果。
Here, with : med0
the median, med
the median of x \\ med0
, xl
elements before med
and xg
elements after med
in the sorted elements, funcX(x)
can be seen as : 在这里,使用:
med0
中位数, med
中间的x \\ med0
, xl
元素在med
之前, xg
元素在med
之后的有序元素中, funcX(x)
可以看作:
<|x-med|> + med = [sum(xg) - sum(xl) - |med0-med|] / windowsize + med
So an idea it to maintain a buffer which represents the sorted current window, sum(xg)
and sum(xl)
. 因此,它想要维护一个缓冲区,它代表排序的当前窗口,
sum(xg)
和sum(xl)
。 Using Numba just in time compilation, very good performance arise here. 使用Numba及时编译,这里出现了非常好的性能。
First the buffer management: 首先是缓冲管理:
init
sorts the first window and compute left( xls
) and right( xgs
) sums. init
对第一个窗口进行排序并计算左( xls
)和右( xgs
)和。
import numpy as np
import numba
windowsize = 51 #odd, >1
halfsize = windowsize//2
@numba.njit
def init(firstwindow):
buffer = np.sort(firstwindow)
xls = buffer[:halfsize].sum()
xgs = buffer[-halfsize:].sum()
return buffer,xls,xgs
shift
is the linear part. shift
是线性部分。 It updates the buffer, maintaining it sorted . 它会更新缓冲区,并对其进行排序。
np.searchsorted
computes positions of insertion and deletion in O(log(windowsize))
. np.searchsorted
计算O(log(windowsize))
中插入和删除的位置。 It's technical since xin<xout
and xout<xin
are not symmetrical situations. 这是技术性的,因为
xin<xout
和xout<xin
不是对称的情况。
@numba.njit
def shift(buffer,xin,xout):
i_in = np.searchsorted(buffer,xin)
i_out = np.searchsorted(buffer,xout)
if xin <= xout :
buffer[i_in+1:i_out+1] = buffer[i_in:i_out]
buffer[i_in] = xin
else:
buffer[i_out:i_in-1] = buffer[i_out+1:i_in]
buffer[i_in-1] = xin
return i_in, i_out
update
updates the buffer and the sums of left and right parts. update
更新缓冲区以及左右部分的总和。 It's technical since xin<xout
and xout<xin
are not symmetrical situations. 这是技术性的,因为
xin<xout
和xout<xin
不是对称的情况。
@numba.njit
def update(buffer,xls,xgs,xin,xout):
xl,x0,xg = buffer[halfsize-1:halfsize+2]
i_in,i_out = shift(buffer,xin,xout)
if i_out < halfsize:
xls -= xout
if i_in <= halfsize:
xls += xin
else:
xls += x0
elif i_in < halfsize:
xls += xin - xl
if i_out > halfsize:
xgs -= xout
if i_in > halfsize:
xgs += xin
else:
xgs += x0
elif i_in > halfsize+1:
xgs += xin - xg
return buffer, xls, xgs
func
is equivalent to the original funcX
on buffer. func
等效于缓冲区上的原始funcX
。 O(1)
. O(1)
。
@numba.njit
def func(buffer,xls,xgs):
med0 = buffer[halfsize]
med = (buffer[halfsize-1] + buffer[halfsize+1])/2
if med0 > med:
return (xgs-xls+med0-med) / windowsize + med
else:
return (xgs-xls+med-med0) / windowsize + med
med
is the global function. med
是全球功能。 O(data.size * windowsize)
. O(data.size * windowsize)
。
@numba.njit
def med(data):
res = np.full_like(data, np.nan)
state = init(data[:windowsize])
res[halfsize] = func(*state)
for i in range(windowsize, data.size):
xin,xout = data[i], data[i - windowsize]
state = update(*state, xin, xout)
res[i-halfsize] = func(*state)
return res
Performance : 表现:
import pandas
data=pandas.DataFrame(np.random.rand(10**5))
%time res1=data[0].rolling(window = windowsize, center = True).apply(funcX, raw = True)
Wall time: 10.8 s
res2=med(data[0].values)
np.allclose((res1-res2)[halfsize:-halfsize],0)
Out[112]: True
%timeit res2=med(data[0].values)
40.4 ms ± 462 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
it's ~ 250X faster, with window size = 51. An hour becomes 15 seconds. 它的速度快〜250倍,窗口大小= 51.一小时变为15秒。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.