矢量化 Pandas DataFrame

Question

Assume the following simplified framework:假设以下简化框架：

I have a 3D Pandas dataframe of parameters composed of 100 rows, 4 classes and 4 features for each instance:我有一个 3D Pandas dataframe 参数，每个实例由 100 行、4 个类和 4 个特征组成：

iterables = [list(range(100)), [0,1,2,3]]
index = pd.MultiIndex.from_product(iterables, names=['instances', 'classes'])
columns = ['a', 'b', 'c', 'd']
np.random.seed(42)
parameters = pd.DataFrame(np.random.randint(1, 2000, size=(len(index), len(columns))), index=index, columns=columns)

parameters

instances classes   a     b     c     d                     
0         0        1127  1460   861  1295
          1        1131  1096  1725  1045
          2        1639   122   467  1239
          3         331  1483    88  1397
1         0        1124   872  1688   131
...                 ...   ...   ...   ...
98        3        1321  1750   779  1431
99        0        1793   814  1637  1429
          1        1370  1646   420  1206
          2         983   825  1025  1855
          3        1974   567   371   936

Let df be a dataframe that for each instance and each feature (column), report the observed class.令df为 dataframe，对于每个实例和每个特征（列），报告观察到的 class。

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 3, size=(100, len(columns))), index=list(range(100)), 
                  columns=columns)
    a  b  c  d
0   2  0  2  2
1   0  0  2  1
2   2  2  2  2
3   0  2  1  0
4   1  1  1  1
.. .. .. .. ..
95  1  2  0  1
96  2  1  2  1
97  0  0  1  2
98  0  0  0  1
99  1  2  2  2

I would like to create a third dataframe (let's call it new_df ) of shape (100, 4) containing the parameters in the dataframe parameters based on the observed classes on the dataframe df .我想创建第三个形状为 (100, 4) 的 dataframe（我们称之为new_df ），其中包含 dataframe parameters中的参数，基于在df上观察到的类。

For example, in the first row of df for the first column (a) i observe the class 2, so the value I am interested in is the second class in the first instance of the parameters dataframe, namely 1127 that will populate the first row and column of new df .例如，在第一列 (a) 的df的第一行中，我观察到 class 2，所以我感兴趣的值是parameters Z6A8064B5DF4794555500553C117C50 的第一个实例中的第二个 class ，即和new df列。 Following this method, the first observation for the column "b" is class 0, so in the first row, column b of the new_df I would like to observe 1460 and so on.按照这种方法，“b”列的第一个观察结果是 class 0，所以在第一行， new_df的 b 列我想观察 1460 等等。

With a for loop I can obtain the desired result:使用 for 循环，我可以获得所需的结果：

new_df = pd.DataFrame(0, index=list(range(100)), columns=columns) # initialize the df
for i in range(len(df)):
    for c in df.columns:
        new_df.iloc[i][c] = parameters.loc[i][c][df.iloc[i][c]]

new_df

    a     b      c    d
0   1639  1460   467  1239
1   1124   872   806   344
2   1083   511  1706  1500
3    958  1155  1268   563
4     14   242   777  1370
..   ...   ...   ...   ...
95  1435  1316  1709   755
96   346   712   363   815
97  1234   985   683  1348
98   127  1130  1009  1014
99  1370   825  1025  1855

However, the original dataset contains millions of rows and hundreds of columns, and proceeding with for loop is unfeasible.但是，原始数据集包含数百万行和数百列，继续进行 for 循环是不可行的。

Is there a way to vectorize such a problem in order to avoid for loops?有没有办法对这样的问题进行矢量化以避免 for 循环？ (at least over 1 dimension) （至少超过 1 个维度）

Answer 1

Reshape both DataFrames, using stack , into a long format, then perform the merge and reshape, with unstack , back to the wide format.使用stack将两个 DataFrame 重新整形为长格式，然后使用unstack执行合并和整形，恢复为宽格式。 There's a bunch of renaming just so we can reference and align the columns in the merge.有一堆重命名只是为了我们可以引用和对齐合并中的列。

(df.rename_axis(index='instances', columns='cols').stack().to_frame('classes')
   .merge(parameters.rename_axis(columns='cols').stack().rename('vals'),
          on=['instances', 'classes', 'cols'])
   .unstack(-1)['vals']
   .rename_axis(index=None, columns=None)
)

       a     b     c     d
0   1639  1460   467  1239
1   1124   872   806   344
2   1083   511  1706  1500
3    958  1155  1268   563
4     14   242   777  1370
..   ...   ...   ...   ...
95  1435  1316  1709   755
96   346   712   363   815
97  1234   985   683  1348
98   127  1130  1009  1014
99  1370   825  1025  1855

矢量化 Pandas DataFrame

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-01-25 17:33:10

矢量化 Pandas DataFrame

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-01-25 17:33:10

解决方案1
3 已采纳 2021-01-25 17:33:10