Create an array of arrays from a dataframe with multiple multivariate time series in Python

Question

I need to create an array of arrays from the dataframe:

HR    sBP   dBP  T     ID
101   51    81   37.1  P1.1
102   52    82   37.2  P1.1
103   53    83   37.3  P1.1
104   54    84   37.4  P1.1
105   55    85   37.5  P1.1
210   65    90   36.1  P1.2
210   65    90   36.2  P1.2
210   65    90   36.3  P1.2
210   65    90   36.4  P1.2
210   65    90   36.5  P1.2
...
100   50    75   37    Pm.n
100   50    75   37    Pm.n
...
100   50    60   37.0  P1500.6
100   50    60   37.0  P1500.6
100   50    60   37.0  P1500.6
100   50    60   37.0  P1500.6
100   50    60   37.0  P1500.6

where each chunk is a multivariate time series with HR, sBP, dBP and T° as variables, and the ID variable is the label for each subseries of data from each patient. The chunks for each patient are of variable length. I need to end up with an array like this:

array([[[101,    51,    81,    37.1],
        [102,    52,    82,    37.2],
        [103,    53,    83,    37.2],
        [104,    54,    84,    37.2],
        [105,    55,    85,    37.2]],

       [[210,    65,    90,    36.1],
        [210,    65,    90,    36.2],
        [210,    65,    90,    36.3],
        [210,    65,    90,    36.4],
        [210,    65,    90,    36.5]],

      ...

       [[100,    50,    60,    37.0], 
        [100,    50,    60,    37.0],
        [100,    50,    60,    37.0],  
        [100,    50,    60,    37.0],
        [100,    50,    60,    37.0]]])

With array.shape = (number of unique IDs, length of arrays, number of dimensions)

My code looks like this:

df_grp = df.groupby('ID')

for name, gp in df_grp:
    if name == 'P1.1':
        arr = gp.drop(columns = ['ID']).to_numpy().reshape(-1,4)  

    else:
        temp_arr = gp.drop(columns = ['ID']).to_numpy().reshape(-1,4)  
        arr = np.append(arr, temp_arr, axis=0)

But it gives me an array like this

array ([[101,    51,    81,    37.1],
        [102,    52,    82,    37.2],
        [103,    53,    83,    37.2],
        [104,    54,    84,    37.2],
        [105,    55,    85,    37.2],
        [210,    65,    90,    36.1],
        [210,    65,    90,    36.2],
        [210,    65,    90,    36.3],
        [210,    65,    90,    36.4],
        [210,    65,    90,    36.5]],

      ...

        [100,    50,    60,    37.0], 
        [100,    50,    60,    37.0],
        [100,    50,    60,    37.0],  
        [100,    50,    60,    37.0],
        [100,    50,    60,    37.0]])

With array.shape = (number of rows in df, number of dimensions) . With or without reshape the result is the same, as well as with squeeze . I need the array in the aforementioned format so I can use it in the tslearn package for multivariate time series clustering. Any help is greatly appreciated.

Answer 1

I think you are looking for this:

arr = df.set_index('ID').groupby('ID').apply(pd.DataFrame.to_numpy).to_numpy()

Similar to your solution, first groupby and then use to_numpy to convert them to arrays. Please note that you cannot have non rectangular numpy arrays if your arrays have different shapes(ie different ID lengths). Therefore, this code returns an array of arrays you are looking for.

output:

[array([[101. ,  51. ,  81. ,  37.1],
        [102. ,  52. ,  82. ,  37.2],
        [103. ,  53. ,  83. ,  37.3],
        [104. ,  54. ,  84. ,  37.4],
        [105. ,  55. ,  85. ,  37.5]])
  array([[210. ,  65. ,  90. ,  36.1],
        [210. ,  65. ,  90. ,  36.2],
        [210. ,  65. ,  90. ,  36.3],
        [210. ,  65. ,  90. ,  36.4],
        [210. ,  65. ,  90. ,  36.5]])
 ...
  array([[100.,  50.,  75.,  37.],
        [100.,  50.,  75.,  37.]])
 ...
  array([[100.,  50.,  60.,  37.],
        [100.,  50.,  60.,  37.],
        [100.,  50.,  60.,  37.],
        [100.,  50.,  60.,  37.],
        [100.,  50.,  60.,  37.]])]

If all 'ID' s have same number of rows, you can stack the numpy array arr above to get a single array:

np.stack(arr)

[[[101.   51.   81.   37.1]
  [102.   52.   82.   37.2]
  [103.   53.   83.   37.3]
  [104.   54.   84.   37.4]
  [105.   55.   85.   37.5]]

 [[210.   65.   90.   36.1]
  [210.   65.   90.   36.2]
  [210.   65.   90.   36.3]
  [210.   65.   90.   36.4]
  [210.   65.   90.   36.5]]
...
 [[100.   50.   60.   37. ]
  [100.   50.   60.   37. ]
  [100.   50.   60.   37. ]
  [100.   50.   60.   37. ]
  [100.   50.   60.   37. ]]]

Create an array of arrays from a dataframe with multiple multivariate time series in Python

Question

1 answers

solution1
2 ACCPTED 2020-06-10 01:13:44

Create an array of arrays from a dataframe with multiple multivariate time series in Python

Question

1 answers

solution1 2 ACCPTED 2020-06-10 01:13:44

solution1
2 ACCPTED 2020-06-10 01:13:44