简体   繁体   English

滚动窗口的数据帧表示

[英]dataframe representation of a rolling window

I want a dataframe representation of of a rolling window. 我想要一个滚动窗口的数据帧表示。 Instead of performing some operation on a rolling window, I want a dataframe where the window is represented in another dimension. 我没有在滚动窗口上执行某些操作,而是想要一个数据框,其中窗口在另一个维度中表示。 This could be as a pd.Panel or np.array or a pd.DataFrame with a pd.MultiIndex . 这可以是pd.Panelnp.array或带有pd.DataFramepd.MultiIndex

Setup 设定

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(10, 3).round(2), 
                  columns=['A', 'B', 'C'],
                  index=list('abcdefghij'))

print df

      A     B     C
a  0.44  0.41  0.46
b  0.47  0.46  0.02
c  0.85  0.82  0.78
d  0.76  0.93  0.83
e  0.88  0.93  0.72
f  0.12  0.15  0.20
g  0.44  0.10  0.28
h  0.61  0.09  0.84
i  0.74  0.87  0.69
j  0.38  0.23  0.44

Expected Output 预期产出

For a window = 2 I'd expect the result to be. 对于一个window = 2我希望结果是。

      0                 1            
      A     B     C     A     B     C
a  0.44  0.41  0.46  0.47  0.46  0.02
b  0.47  0.46  0.02  0.85  0.82  0.78
c  0.85  0.82  0.78  0.76  0.93  0.83
d  0.76  0.93  0.83  0.88  0.93  0.72
e  0.88  0.93  0.72  0.12  0.15  0.20
f  0.12  0.15  0.20  0.44  0.10  0.28
g  0.44  0.10  0.28  0.61  0.09  0.84
h  0.61  0.09  0.84  0.74  0.87  0.69
i  0.74  0.87  0.69  0.38  0.23  0.44

I'm not determined to have the layout presented this way, but this is the information I want. 我不打算以这种方式呈现布局,但这是我想要的信息。 I'm looking for the most efficient way to get at this. 我正在寻找最有效的方法。

What I've done so far 到目前为止我做了什么

I've experimented with using shift in varying ways but it feels clunky. 我已经尝试过以不同的方式使用shift ,但它感觉很笨重。 This is what I use to produce the output above: 这是我用来产生上面的输出:

print pd.concat([df, df.shift(-1)], axis=1, keys=[0, 1]).dropna()

We could use NumPy to get views into those sliding windows with its esoteric strided tricks . 我们可以使用NumPy以其深奥的 strided tricks来获取那些滑动窗口的视图。 If you are using this new dimension for some reduction like matrix-multiplication, this would be ideal. 如果您正在使用这个新维度进行矩阵乘法等减少,那么这将是理想的选择。 If for some reason, you want to have a 2D output, we need to use a reshape at the end, which will result in creating a copy though. 如果由于某种原因,你想要一个2D输出,我们需要在最后使用一个重塑,这将导致创建一个副本。

Thus, the implementation would look something like this - 因此,实现看起来像这样 -

from numpy.lib.stride_tricks import as_strided as strided

def get_sliding_window(df, W, return2D=0):
    a = df.values                 
    s0,s1 = a.strides
    m,n = a.shape
    out = strided(a,shape=(m-W+1,W,n),strides=(s0,s0,s1))
    if return2D==1:
        return out.reshape(a.shape[0]-W+1,-1)
    else:
        return out

Sample run for 2D/3D output - 样本运行2D / 3D输出 -

In [68]: df
Out[68]: 
      A     B
0  0.44  0.41
1  0.46  0.47
2  0.46  0.02
3  0.85  0.82
4  0.78  0.76

In [70]: get_sliding_window(df, 3,return2D=1)
Out[70]: 
array([[ 0.44,  0.41,  0.46,  0.47,  0.46,  0.02],
       [ 0.46,  0.47,  0.46,  0.02,  0.85,  0.82],
       [ 0.46,  0.02,  0.85,  0.82,  0.78,  0.76]])

Here's how the 3D views output would look like - 以下是3D视图输出的外观 -

In [69]: get_sliding_window(df, 3,return2D=0)
Out[69]: 
array([[[ 0.44,  0.41],
        [ 0.46,  0.47],
        [ 0.46,  0.02]],

       [[ 0.46,  0.47],
        [ 0.46,  0.02],
        [ 0.85,  0.82]],

       [[ 0.46,  0.02],
        [ 0.85,  0.82],
        [ 0.78,  0.76]]])

Let's time it for views 3D output for various window sizes - 让我们来看看各种窗口尺寸的3D视图 -

In [331]: df = pd.DataFrame(np.random.rand(1000, 3).round(2))

In [332]: %timeit get_3d_shfted_array(df,2) # @Yakym Pirozhenko's soln
10000 loops, best of 3: 47.9 µs per loop

In [333]: %timeit get_sliding_window(df,2)
10000 loops, best of 3: 39.2 µs per loop

In [334]: %timeit get_3d_shfted_array(df,5) # @Yakym Pirozhenko's soln
10000 loops, best of 3: 89.9 µs per loop

In [335]: %timeit get_sliding_window(df,5)
10000 loops, best of 3: 39.4 µs per loop

In [336]: %timeit get_3d_shfted_array(df,15) # @Yakym Pirozhenko's soln
1000 loops, best of 3: 258 µs per loop

In [337]: %timeit get_sliding_window(df,15)
10000 loops, best of 3: 38.8 µs per loop

Let's verify that we are indeed getting views - 让我们确认一下我们确实得到了意见 -

In [338]: np.may_share_memory(get_sliding_window(df,2), df.values)
Out[338]: True

The almost constant timings with get_sliding_window even across various window sizes suggest the huge benefit of getting the view instead of copying. 即使在各种窗口大小的情况下, get_sliding_window的几乎恒定的时序也表明了获取视图而不是复制的巨大好处。

Disclaimers: 免责声明:

First, I would not call the method you provide clunky. 首先,我不会称你提供的方法笨重。 It is readable and you can easily generalize with a list comprehension to any window size. 它是可读的,您可以轻松地将列表理解推广到任何窗口大小。 At the same time, this is somewhat of an open ended question that may have many solutions, including your own. 与此同时,这有点像一个开放式的问题,可能有许多解决方案,包括你自己的解决方案。

/Disclaimers /免责声明

Here is one other method that I think qualifies under your description: 以下是我认为符合您描述的另一种方法:

Use np.dstack on df.values . np.dstack上使用df.values One benefit over existing approach is construction speed. 现有方法的一个好处是施工速度。

import pandas as pd
import numpy as np
from io import StringIO

df = pd.read_csv(StringIO(
'''
      A     B     C
a  0.44  0.41  0.46
b  0.47  0.46  0.02
c  0.85  0.82  0.78
d  0.76  0.93  0.83
e  0.88  0.93  0.72
f  0.12  0.15  0.20
g  0.44  0.10  0.28
h  0.61  0.09  0.84
i  0.74  0.87  0.69
j  0.38  0.23  0.44
'''), sep=r' +')


window = 2

def get_3d_shfted_array(df, window=window):
    rows = df.values
    res  = np.dstack((rows[i:i-window] for i in range(window)))
    return res
# 100000 loops, best of 3: 15.5 µs per loop

res  = get_3d_shfted_array(df)
zero = res[...,0]
one  = res[...,1]


# current method
def get_multiindexed_array(df, window=window):
    return pd.concat([df, df.shift(-1)], axis=1, keys=[0, 1]).dropna()
# 1000 loops, best of 3: 928 µs per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM