[英]dataframe representation of a rolling window
I want a dataframe representation of of a rolling window. 我想要一个滚动窗口的数据帧表示。 Instead of performing some operation on a rolling window, I want a dataframe where the window is represented in another dimension.
我没有在滚动窗口上执行某些操作,而是想要一个数据框,其中窗口在另一个维度中表示。 This could be as a
pd.Panel
or np.array
or a pd.DataFrame
with a pd.MultiIndex
. 这可以是
pd.Panel
或np.array
或带有pd.DataFrame
的pd.MultiIndex
。
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(10, 3).round(2),
columns=['A', 'B', 'C'],
index=list('abcdefghij'))
print df
A B C
a 0.44 0.41 0.46
b 0.47 0.46 0.02
c 0.85 0.82 0.78
d 0.76 0.93 0.83
e 0.88 0.93 0.72
f 0.12 0.15 0.20
g 0.44 0.10 0.28
h 0.61 0.09 0.84
i 0.74 0.87 0.69
j 0.38 0.23 0.44
For a window = 2
I'd expect the result to be. 对于一个
window = 2
我希望结果是。
0 1
A B C A B C
a 0.44 0.41 0.46 0.47 0.46 0.02
b 0.47 0.46 0.02 0.85 0.82 0.78
c 0.85 0.82 0.78 0.76 0.93 0.83
d 0.76 0.93 0.83 0.88 0.93 0.72
e 0.88 0.93 0.72 0.12 0.15 0.20
f 0.12 0.15 0.20 0.44 0.10 0.28
g 0.44 0.10 0.28 0.61 0.09 0.84
h 0.61 0.09 0.84 0.74 0.87 0.69
i 0.74 0.87 0.69 0.38 0.23 0.44
I'm not determined to have the layout presented this way, but this is the information I want. 我不打算以这种方式呈现布局,但这是我想要的信息。 I'm looking for the most efficient way to get at this.
我正在寻找最有效的方法。
I've experimented with using shift
in varying ways but it feels clunky. 我已经尝试过以不同的方式使用
shift
,但它感觉很笨重。 This is what I use to produce the output above: 这是我用来产生上面的输出:
print pd.concat([df, df.shift(-1)], axis=1, keys=[0, 1]).dropna()
We could use NumPy to get views into those sliding windows with its esoteric strided tricks
. 我们可以使用NumPy以其深奥的
strided tricks
来获取那些滑动窗口的视图。 If you are using this new dimension for some reduction like matrix-multiplication, this would be ideal. 如果您正在使用这个新维度进行矩阵乘法等减少,那么这将是理想的选择。 If for some reason, you want to have a
2D
output, we need to use a reshape at the end, which will result in creating a copy though. 如果由于某种原因,你想要一个
2D
输出,我们需要在最后使用一个重塑,这将导致创建一个副本。
Thus, the implementation would look something like this - 因此,实现看起来像这样 -
from numpy.lib.stride_tricks import as_strided as strided
def get_sliding_window(df, W, return2D=0):
a = df.values
s0,s1 = a.strides
m,n = a.shape
out = strided(a,shape=(m-W+1,W,n),strides=(s0,s0,s1))
if return2D==1:
return out.reshape(a.shape[0]-W+1,-1)
else:
return out
Sample run for 2D/3D output - 样本运行2D / 3D输出 -
In [68]: df
Out[68]:
A B
0 0.44 0.41
1 0.46 0.47
2 0.46 0.02
3 0.85 0.82
4 0.78 0.76
In [70]: get_sliding_window(df, 3,return2D=1)
Out[70]:
array([[ 0.44, 0.41, 0.46, 0.47, 0.46, 0.02],
[ 0.46, 0.47, 0.46, 0.02, 0.85, 0.82],
[ 0.46, 0.02, 0.85, 0.82, 0.78, 0.76]])
Here's how the 3D views output would look like - 以下是3D视图输出的外观 -
In [69]: get_sliding_window(df, 3,return2D=0)
Out[69]:
array([[[ 0.44, 0.41],
[ 0.46, 0.47],
[ 0.46, 0.02]],
[[ 0.46, 0.47],
[ 0.46, 0.02],
[ 0.85, 0.82]],
[[ 0.46, 0.02],
[ 0.85, 0.82],
[ 0.78, 0.76]]])
Let's time it for views 3D
output for various window sizes - 让我们来看看各种窗口尺寸的
3D
视图 -
In [331]: df = pd.DataFrame(np.random.rand(1000, 3).round(2))
In [332]: %timeit get_3d_shfted_array(df,2) # @Yakym Pirozhenko's soln
10000 loops, best of 3: 47.9 µs per loop
In [333]: %timeit get_sliding_window(df,2)
10000 loops, best of 3: 39.2 µs per loop
In [334]: %timeit get_3d_shfted_array(df,5) # @Yakym Pirozhenko's soln
10000 loops, best of 3: 89.9 µs per loop
In [335]: %timeit get_sliding_window(df,5)
10000 loops, best of 3: 39.4 µs per loop
In [336]: %timeit get_3d_shfted_array(df,15) # @Yakym Pirozhenko's soln
1000 loops, best of 3: 258 µs per loop
In [337]: %timeit get_sliding_window(df,15)
10000 loops, best of 3: 38.8 µs per loop
Let's verify that we are indeed getting views - 让我们确认一下我们确实得到了意见 -
In [338]: np.may_share_memory(get_sliding_window(df,2), df.values)
Out[338]: True
The almost constant timings with get_sliding_window
even across various window sizes suggest the huge benefit of getting the view instead of copying. 即使在各种窗口大小的情况下,
get_sliding_window
的几乎恒定的时序也表明了获取视图而不是复制的巨大好处。
Disclaimers: 免责声明:
First, I would not call the method you provide clunky. 首先,我不会称你提供的方法笨重。 It is readable and you can easily generalize with a list comprehension to any window size.
它是可读的,您可以轻松地将列表理解推广到任何窗口大小。 At the same time, this is somewhat of an open ended question that may have many solutions, including your own.
与此同时,这有点像一个开放式的问题,可能有许多解决方案,包括你自己的解决方案。
/Disclaimers /免责声明
Here is one other method that I think qualifies under your description: 以下是我认为符合您描述的另一种方法:
Use np.dstack
on df.values
. 在
np.dstack
上使用df.values
。 One benefit over existing approach is construction speed. 现有方法的一个好处是施工速度。
import pandas as pd
import numpy as np
from io import StringIO
df = pd.read_csv(StringIO(
'''
A B C
a 0.44 0.41 0.46
b 0.47 0.46 0.02
c 0.85 0.82 0.78
d 0.76 0.93 0.83
e 0.88 0.93 0.72
f 0.12 0.15 0.20
g 0.44 0.10 0.28
h 0.61 0.09 0.84
i 0.74 0.87 0.69
j 0.38 0.23 0.44
'''), sep=r' +')
window = 2
def get_3d_shfted_array(df, window=window):
rows = df.values
res = np.dstack((rows[i:i-window] for i in range(window)))
return res
# 100000 loops, best of 3: 15.5 µs per loop
res = get_3d_shfted_array(df)
zero = res[...,0]
one = res[...,1]
# current method
def get_multiindexed_array(df, window=window):
return pd.concat([df, df.shift(-1)], axis=1, keys=[0, 1]).dropna()
# 1000 loops, best of 3: 928 µs per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.