I need to create an array of arrays from the dataframe:
HR sBP dBP T ID
101 51 81 37.1 P1.1
102 52 82 37.2 P1.1
103 53 83 37.3 P1.1
104 54 84 37.4 P1.1
105 55 85 37.5 P1.1
210 65 90 36.1 P1.2
210 65 90 36.2 P1.2
210 65 90 36.3 P1.2
210 65 90 36.4 P1.2
210 65 90 36.5 P1.2
...
100 50 75 37 Pm.n
100 50 75 37 Pm.n
...
100 50 60 37.0 P1500.6
100 50 60 37.0 P1500.6
100 50 60 37.0 P1500.6
100 50 60 37.0 P1500.6
100 50 60 37.0 P1500.6
where each chunk is a multivariate time series with HR, sBP, dBP and T° as variables, and the ID
variable is the label for each subseries of data from each patient. The chunks for each patient are of variable length. I need to end up with an array like this:
array([[[101, 51, 81, 37.1],
[102, 52, 82, 37.2],
[103, 53, 83, 37.2],
[104, 54, 84, 37.2],
[105, 55, 85, 37.2]],
[[210, 65, 90, 36.1],
[210, 65, 90, 36.2],
[210, 65, 90, 36.3],
[210, 65, 90, 36.4],
[210, 65, 90, 36.5]],
...
[[100, 50, 60, 37.0],
[100, 50, 60, 37.0],
[100, 50, 60, 37.0],
[100, 50, 60, 37.0],
[100, 50, 60, 37.0]]])
With array.shape = (number of unique IDs, length of arrays, number of dimensions)
My code looks like this:
df_grp = df.groupby('ID')
for name, gp in df_grp:
if name == 'P1.1':
arr = gp.drop(columns = ['ID']).to_numpy().reshape(-1,4)
else:
temp_arr = gp.drop(columns = ['ID']).to_numpy().reshape(-1,4)
arr = np.append(arr, temp_arr, axis=0)
But it gives me an array like this
array ([[101, 51, 81, 37.1],
[102, 52, 82, 37.2],
[103, 53, 83, 37.2],
[104, 54, 84, 37.2],
[105, 55, 85, 37.2],
[210, 65, 90, 36.1],
[210, 65, 90, 36.2],
[210, 65, 90, 36.3],
[210, 65, 90, 36.4],
[210, 65, 90, 36.5]],
...
[100, 50, 60, 37.0],
[100, 50, 60, 37.0],
[100, 50, 60, 37.0],
[100, 50, 60, 37.0],
[100, 50, 60, 37.0]])
With array.shape = (number of rows in df, number of dimensions)
. With or without reshape
the result is the same, as well as with squeeze
. I need the array in the aforementioned format so I can use it in the tslearn package for multivariate time series clustering. Any help is greatly appreciated.
I think you are looking for this:
arr = df.set_index('ID').groupby('ID').apply(pd.DataFrame.to_numpy).to_numpy()
Similar to your solution, first groupby and then use to_numpy to convert them to arrays. Please note that you cannot have non rectangular numpy arrays if your arrays have different shapes(ie different ID lengths). Therefore, this code returns an array of arrays you are looking for.
output:
[array([[101. , 51. , 81. , 37.1],
[102. , 52. , 82. , 37.2],
[103. , 53. , 83. , 37.3],
[104. , 54. , 84. , 37.4],
[105. , 55. , 85. , 37.5]])
array([[210. , 65. , 90. , 36.1],
[210. , 65. , 90. , 36.2],
[210. , 65. , 90. , 36.3],
[210. , 65. , 90. , 36.4],
[210. , 65. , 90. , 36.5]])
...
array([[100., 50., 75., 37.],
[100., 50., 75., 37.]])
...
array([[100., 50., 60., 37.],
[100., 50., 60., 37.],
[100., 50., 60., 37.],
[100., 50., 60., 37.],
[100., 50., 60., 37.]])]
If all 'ID'
s have same number of rows, you can stack the numpy array arr
above to get a single array:
np.stack(arr)
[[[101. 51. 81. 37.1]
[102. 52. 82. 37.2]
[103. 53. 83. 37.3]
[104. 54. 84. 37.4]
[105. 55. 85. 37.5]]
[[210. 65. 90. 36.1]
[210. 65. 90. 36.2]
[210. 65. 90. 36.3]
[210. 65. 90. 36.4]
[210. 65. 90. 36.5]]
...
[[100. 50. 60. 37. ]
[100. 50. 60. 37. ]
[100. 50. 60. 37. ]
[100. 50. 60. 37. ]
[100. 50. 60. 37. ]]]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.