如何获得大熊猫数据框，其中列是来自另一列数据框的后续n元素？

Question

A very simple example just for understanding. 一个非常简单的示例，仅用于理解。

I have the following pandas dataframe: 我有以下熊猫数据框：

import pandas as pd
df = pd.DataFrame({'A':pd.Series([1, 2, 13, 14, 25, 26, 37, 38])})
df 
        A
    0   1
    1   2
    2  13
    3  14
    4  25
    5  26
    6  37
    8  38

Set n = 3 设置n = 3

First example 第一个例子

How to get a new dataframe df1 (in an efficient way), like the following: 如何（以有效的方式）获取新的数据帧df1 ，如下所示：

   D1  D2  D3     T
0   1   2  13    14
1   2  13  14    25
2  13  14  25    26
3  14  25  26    37
4  25  26  37    38

Hint: think at the first n-columns as the data (Dx) and the last columns as the target (T). 提示：将前n列视为数据（Dx），将最后几列视为目标（T）。 In the 1st example the target (eg 25) depends on the preceding n-elements (2, 13, 14). 在第一个示例中，目标（例如25个）取决于前面的n个元素（2、13、14）。

Second example 第二个例子

What if the target is some element ahead (eg+3)? 如果目标比目标高一些（例如+3）怎么办？

   D1  D2  D3     T
0   1   2  13    26
1   2  13  14    37
2  13  14  25    38

Thank you for your help, 谢谢您的帮助，
Gilberto 吉尔伯托

PS If you think that the title can be improved, please suggest me how to modify it. PS：如果您认为标题可以改进，请建议我如何修改它。

Update 更新

Thanks to @Divakar and this post the rolling function can be defined as: 感谢@Divakar，本文的滚动功能可以定义为：

import numpy as np
def rolling(a, window):
    shape = (a.size - window + 1, window)
    strides = (a.itemsize, a.itemsize)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

a = np.arange(1000000000)
b = rolling(a, 4)

In less than 1 second! 不到1秒！

Answer 1

Let's see how we can solve it with NumPy tools. 让我们看看如何使用NumPy工具解决它。 So, let's imagine you have the column data as a NumPy array, let's call it a . 因此，假设您将列数据作为NumPy数组，我们将其a 。 For such sliding windowed operations, we have a very efficient tool in NumPy as strides , as they are views into the input array without actually making copies. 对于这样的滑动窗口的操作，我们在NumPy的一个非常有效的工具， strides ，因为他们views到输入阵列而不实际进行复印。

Let's directly use the methods with the sample data and start with case #1 - 让我们直接将这些方法用于示例数据，并从案例1开始-

In [29]: a  # Input data
Out[29]: array([ 1,  2, 13, 14, 25, 26, 37, 38])

In [30]: m = a.strides[0] # Get strides

In [31]: n = 3 # parameter

In [32]: nrows = a.size - n # Get number of rows in o/p

In [33]: a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,n+1),strides=(m,m))

In [34]: a2D
Out[34]: 
array([[ 1,  2, 13, 14],
       [ 2, 13, 14, 25],
       [13, 14, 25, 26],
       [14, 25, 26, 37],
       [25, 26, 37, 38]])

In [35]: np.may_share_memory(a,a2D) 
Out[35]: True    # a2D is a view into a

Case #2 would be similar with an additional parameter for the Target column - 情况＃2与“ Target列的附加参数类似-

In [36]: n2 = 3 # Additional param

In [37]: nrows = a.size - n - n2 + 1

In [38]: part1 = np.lib.stride_tricks.as_strided(a,shape=(nrows,n),strides=(m,m))

In [39]: part1 # These are D1, D2, D3, etc.
Out[39]: 
array([[ 1,  2, 13],
       [ 2, 13, 14],
       [13, 14, 25]])

In [43]: part2 = a[n+n2-1:] # This is target col

In [44]: part2
Out[44]: array([26, 37, 38])

Answer 2

I found another method: view_as_windows 我找到了另一种方法： view_as_windows

import numpy as np
from skimage.util.shape import view_as_windows
window_shape = (4, )

aa = np.arange(1000000000) # 1 billion!
bb = view_as_windows(aa, window_shape)
bb

array([[        0,         1,         2,         3],
       [        1,         2,         3,         4],
       [        2,         3,         4,         5],
       ..., 
       [999999994, 999999995, 999999996, 999999997],
       [999999995, 999999996, 999999997, 999999998],
       [999999996, 999999997, 999999998, 999999999]])

Around 1 second. 1秒左右。

What do you think? 你怎么看？

如何获得大熊猫数据框，其中列是来自另一列数据框的后续n元素？

问题描述

First example 第一个例子

Second example 第二个例子

Update 更新

2 个解决方案

解决方案1
2 已采纳 2016-10-24 12:02:41

解决方案2
0 2016-10-25 11:44:00

如何获得大熊猫数据框，其中列是来自另一列数据框的后续n元素？

问题描述

First example 第一个例子

Second example 第二个例子

Update 更新

2 个解决方案

解决方案1 2 已采纳 2016-10-24 12:02:41

解决方案2 0 2016-10-25 11:44:00

解决方案1
2 已采纳 2016-10-24 12:02:41

解决方案2
0 2016-10-25 11:44:00