如何将Pandas 2d MultiIndex重塑为numpy 3d更快？

Question

I have the following code that works well: 我有以下运行良好的代码：

import pandas as pd
import numpy as np

X = pd.DataFrame({'CaseID':[1,1,2,2],
              'col1':  [1,2,1,2],
              'col2':  [1,1,2,2]})
X.set_index(['CaseID','col1'], inplace=True) #MultiIndex

Unique_Cases = X.index.levels[0]
print(Unique_Cases)
#[1, 2]

D = [X.loc[Case].values for Case in Unique_Cases]
print(np.array(D).shape)
#(2, 2, 1)

But the problem is that I have 50 million records and it takes a lot of time (10 hours). 但是问题是我有5000万条记录，并且需要很多时间（10个小时）。 There is a faster way to turn 2d pandas to 3d numpy array? 有一种更快的方法可以将2d熊猫变成3d numpy数组？

clarification: 澄清：

len(X.loc[Case])

Not always the same length. 长度并不总是相同。

Solution: 解：

case_counts = X.CaseID.value_counts().to_frame('counts').sort_index()
case_counts['count_cumsum'] = case_counts.counts.cumsum()
#drop the last row for split
case_counts.drop(case_counts.tail(1).index,inplace=True)
cat_values = X[cat].values
cat_values = np.split(cat_values, case_counts.count_cumsum)

Answer 1

The solution is np.split : 解决方案是np.split：

case_counts = X.CaseID.value_counts().to_frame('counts').sort_index()
case_counts['count_cumsum'] = case_counts.counts.cumsum()
#drop the last row for split
case_counts.drop(case_counts.tail(1).index,inplace=True)
cat_values = X[cat].values
cat_values = np.split(cat_values, case_counts.count_cumsum)

如何将Pandas 2d MultiIndex重塑为numpy 3d更快？

问题描述

clarification: 澄清：

Solution: 解：

1 个解决方案

解决方案1
0 2019-03-04 14:03:41

如何将Pandas 2d MultiIndex重塑为numpy 3d更快？

问题描述

clarification: 澄清：

Solution: 解：

1 个解决方案

解决方案1 0 2019-03-04 14:03:41

解决方案1
0 2019-03-04 14:03:41