从 dataframe 创建一个 arrays 数组，在 Python 中具有多个多元时间序列

Question

I need to create an array of arrays from the dataframe:我需要从 dataframe 创建一个 arrays 数组：

HR    sBP   dBP  T     ID
101   51    81   37.1  P1.1
102   52    82   37.2  P1.1
103   53    83   37.3  P1.1
104   54    84   37.4  P1.1
105   55    85   37.5  P1.1
210   65    90   36.1  P1.2
210   65    90   36.2  P1.2
210   65    90   36.3  P1.2
210   65    90   36.4  P1.2
210   65    90   36.5  P1.2
...
100   50    75   37    Pm.n
100   50    75   37    Pm.n
...
100   50    60   37.0  P1500.6
100   50    60   37.0  P1500.6
100   50    60   37.0  P1500.6
100   50    60   37.0  P1500.6
100   50    60   37.0  P1500.6

where each chunk is a multivariate time series with HR, sBP, dBP and T° as variables, and the ID variable is the label for each subseries of data from each patient.其中每个块是一个多元时间序列，其中 HR、sBP、dBP 和 T° 作为变量， ID变量是来自每个患者的每个数据子序列的 label。 The chunks for each patient are of variable length.每个患者的块的长度是可变的。 I need to end up with an array like this:我需要以这样的数组结束：

array([[[101,    51,    81,    37.1],
        [102,    52,    82,    37.2],
        [103,    53,    83,    37.2],
        [104,    54,    84,    37.2],
        [105,    55,    85,    37.2]],

       [[210,    65,    90,    36.1],
        [210,    65,    90,    36.2],
        [210,    65,    90,    36.3],
        [210,    65,    90,    36.4],
        [210,    65,    90,    36.5]],

      ...

       [[100,    50,    60,    37.0], 
        [100,    50,    60,    37.0],
        [100,    50,    60,    37.0],  
        [100,    50,    60,    37.0],
        [100,    50,    60,    37.0]]])

With array.shape = (number of unique IDs, length of arrays, number of dimensions)使用array.shape = (number of unique IDs, length of arrays, number of dimensions)

My code looks like this:我的代码如下所示：

df_grp = df.groupby('ID')

for name, gp in df_grp:
    if name == 'P1.1':
        arr = gp.drop(columns = ['ID']).to_numpy().reshape(-1,4)  

    else:
        temp_arr = gp.drop(columns = ['ID']).to_numpy().reshape(-1,4)  
        arr = np.append(arr, temp_arr, axis=0)

But it gives me an array like this但它给了我一个这样的数组

array ([[101,    51,    81,    37.1],
        [102,    52,    82,    37.2],
        [103,    53,    83,    37.2],
        [104,    54,    84,    37.2],
        [105,    55,    85,    37.2],
        [210,    65,    90,    36.1],
        [210,    65,    90,    36.2],
        [210,    65,    90,    36.3],
        [210,    65,    90,    36.4],
        [210,    65,    90,    36.5]],

      ...

        [100,    50,    60,    37.0], 
        [100,    50,    60,    37.0],
        [100,    50,    60,    37.0],  
        [100,    50,    60,    37.0],
        [100,    50,    60,    37.0]])

With array.shape = (number of rows in df, number of dimensions) .使用array.shape = (number of rows in df, number of dimensions) 。 With or without reshape the result is the same, as well as with squeeze .不管有没有reshape ，结果都是一样的， squeeze也是一样的。 I need the array in the aforementioned format so I can use it in the tslearn package for multivariate time series clustering.我需要上述格式的数组，以便可以在 tslearn package 中使用它进行多变量时间序列聚类。 Any help is greatly appreciated.任何帮助是极大的赞赏。

Answer 1

I think you are looking for this:我想你正在寻找这个：

arr = df.set_index('ID').groupby('ID').apply(pd.DataFrame.to_numpy).to_numpy()

Similar to your solution, first groupby and then use to_numpy to convert them to arrays.与您的解决方案类似，首先 groupby 然后使用 to_numpy 将它们转换为 arrays。 Please note that you cannot have non rectangular numpy arrays if your arrays have different shapes(ie different ID lengths).请注意，如果您的 arrays 具有不同的形状（即不同的 ID 长度），则不能使用非矩形 numpy arrays。 Therefore, this code returns an array of arrays you are looking for.因此，此代码返回您要查找的 arrays 数组。

output: output：

[array([[101. ,  51. ,  81. ,  37.1],
        [102. ,  52. ,  82. ,  37.2],
        [103. ,  53. ,  83. ,  37.3],
        [104. ,  54. ,  84. ,  37.4],
        [105. ,  55. ,  85. ,  37.5]])
  array([[210. ,  65. ,  90. ,  36.1],
        [210. ,  65. ,  90. ,  36.2],
        [210. ,  65. ,  90. ,  36.3],
        [210. ,  65. ,  90. ,  36.4],
        [210. ,  65. ,  90. ,  36.5]])
 ...
  array([[100.,  50.,  75.,  37.],
        [100.,  50.,  75.,  37.]])
 ...
  array([[100.,  50.,  60.,  37.],
        [100.,  50.,  60.,  37.],
        [100.,  50.,  60.,  37.],
        [100.,  50.,  60.,  37.],
        [100.,  50.,  60.,  37.]])]

If all 'ID' s have same number of rows, you can stack the numpy array arr above to get a single array:如果所有'ID'的行数相同，则可以堆叠上面的 numpy 数组arr以获得单个数组：

np.stack(arr)

[[[101.   51.   81.   37.1]
  [102.   52.   82.   37.2]
  [103.   53.   83.   37.3]
  [104.   54.   84.   37.4]
  [105.   55.   85.   37.5]]

 [[210.   65.   90.   36.1]
  [210.   65.   90.   36.2]
  [210.   65.   90.   36.3]
  [210.   65.   90.   36.4]
  [210.   65.   90.   36.5]]
...
 [[100.   50.   60.   37. ]
  [100.   50.   60.   37. ]
  [100.   50.   60.   37. ]
  [100.   50.   60.   37. ]
  [100.   50.   60.   37. ]]]

从 dataframe 创建一个 arrays 数组，在 Python 中具有多个多元时间序列

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-06-10 01:13:44

从 dataframe 创建一个 arrays 数组，在 Python 中具有多个多元时间序列

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-06-10 01:13:44

解决方案1
2 已采纳 2020-06-10 01:13:44