[英]How to get pandas dataframe where columns are the subsequent n-elements from another column dataframe?
A very simple example just for understanding. 一个非常简单的示例,仅用于理解。
I have the following pandas dataframe: 我有以下熊猫数据框:
import pandas as pd
df = pd.DataFrame({'A':pd.Series([1, 2, 13, 14, 25, 26, 37, 38])})
df
A
0 1
1 2
2 13
3 14
4 25
5 26
6 37
8 38
Set n = 3
设置
n = 3
How to get a new dataframe df1
(in an efficient way), like the following: 如何(以有效的方式)获取新的数据帧
df1
,如下所示:
D1 D2 D3 T
0 1 2 13 14
1 2 13 14 25
2 13 14 25 26
3 14 25 26 37
4 25 26 37 38
Hint: think at the first n-columns as the data (Dx) and the last columns as the target (T). 提示:将前n列视为数据 (Dx),将最后几列视为目标 (T)。 In the 1st example the target (eg 25) depends on the preceding n-elements (2, 13, 14).
在第一个示例中,目标(例如25个)取决于前面的n个元素(2、13、14)。
What if the target is some element ahead (eg+3)? 如果目标比目标高一些(例如+3)怎么办?
D1 D2 D3 T
0 1 2 13 26
1 2 13 14 37
2 13 14 25 38
Thank you for your help, 谢谢您的帮助,
Gilberto 吉尔伯托
PS If you think that the title can be improved, please suggest me how to modify it. PS:如果您认为标题可以改进,请建议我如何修改它。
Thanks to @Divakar and this post the rolling function can be defined as: 感谢@Divakar, 本文的滚动功能可以定义为:
import numpy as np
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = np.arange(1000000000)
b = rolling(a, 4)
In less than 1 second! 不到1秒!
Let's see how we can solve it with NumPy tools. 让我们看看如何使用NumPy工具解决它。 So, let's imagine you have the column data as a NumPy array, let's call it
a
. 因此,假设您将列数据作为NumPy数组,我们将其
a
。 For such sliding windowed operations, we have a very efficient tool in NumPy as strides
, as they are views
into the input array without actually making copies. 对于这样的滑动窗口的操作,我们在NumPy的一个非常有效的工具,
strides
,因为他们views
到输入阵列而不实际进行复印。
Let's directly use the methods with the sample data and start with case #1 - 让我们直接将这些方法用于示例数据,并从案例1开始-
In [29]: a # Input data
Out[29]: array([ 1, 2, 13, 14, 25, 26, 37, 38])
In [30]: m = a.strides[0] # Get strides
In [31]: n = 3 # parameter
In [32]: nrows = a.size - n # Get number of rows in o/p
In [33]: a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,n+1),strides=(m,m))
In [34]: a2D
Out[34]:
array([[ 1, 2, 13, 14],
[ 2, 13, 14, 25],
[13, 14, 25, 26],
[14, 25, 26, 37],
[25, 26, 37, 38]])
In [35]: np.may_share_memory(a,a2D)
Out[35]: True # a2D is a view into a
Case #2 would be similar with an additional parameter for the Target
column - 情况#2与“
Target
列的附加参数类似-
In [36]: n2 = 3 # Additional param
In [37]: nrows = a.size - n - n2 + 1
In [38]: part1 = np.lib.stride_tricks.as_strided(a,shape=(nrows,n),strides=(m,m))
In [39]: part1 # These are D1, D2, D3, etc.
Out[39]:
array([[ 1, 2, 13],
[ 2, 13, 14],
[13, 14, 25]])
In [43]: part2 = a[n+n2-1:] # This is target col
In [44]: part2
Out[44]: array([26, 37, 38])
I found another method: view_as_windows 我找到了另一种方法: view_as_windows
import numpy as np
from skimage.util.shape import view_as_windows
window_shape = (4, )
aa = np.arange(1000000000) # 1 billion!
bb = view_as_windows(aa, window_shape)
bb
array([[ 0, 1, 2, 3],
[ 1, 2, 3, 4],
[ 2, 3, 4, 5],
...,
[999999994, 999999995, 999999996, 999999997],
[999999995, 999999996, 999999997, 999999998],
[999999996, 999999997, 999999998, 999999999]])
Around 1 second. 1秒左右。
What do you think? 你怎么看?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.