[英]Vectorization Pandas DataFrame
Assume the following simplified framework:假设以下简化框架:
I have a 3D Pandas dataframe of parameters composed of 100 rows, 4 classes and 4 features for each instance:我有一个 3D Pandas dataframe 参数,每个实例由 100 行、4 个类和 4 个特征组成:
iterables = [list(range(100)), [0,1,2,3]]
index = pd.MultiIndex.from_product(iterables, names=['instances', 'classes'])
columns = ['a', 'b', 'c', 'd']
np.random.seed(42)
parameters = pd.DataFrame(np.random.randint(1, 2000, size=(len(index), len(columns))), index=index, columns=columns)
parameters
instances classes a b c d
0 0 1127 1460 861 1295
1 1131 1096 1725 1045
2 1639 122 467 1239
3 331 1483 88 1397
1 0 1124 872 1688 131
... ... ... ... ...
98 3 1321 1750 779 1431
99 0 1793 814 1637 1429
1 1370 1646 420 1206
2 983 825 1025 1855
3 1974 567 371 936
Let df
be a dataframe that for each instance and each feature (column), report the observed class.令df
为 dataframe,对于每个实例和每个特征(列),报告观察到的 class。
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 3, size=(100, len(columns))), index=list(range(100)),
columns=columns)
a b c d
0 2 0 2 2
1 0 0 2 1
2 2 2 2 2
3 0 2 1 0
4 1 1 1 1
.. .. .. .. ..
95 1 2 0 1
96 2 1 2 1
97 0 0 1 2
98 0 0 0 1
99 1 2 2 2
I would like to create a third dataframe (let's call it new_df
) of shape (100, 4) containing the parameters in the dataframe parameters
based on the observed classes on the dataframe df
.我想创建第三个形状为 (100, 4) 的 dataframe(我们称之为new_df
),其中包含 dataframe parameters
中的参数,基于在df
上观察到的类。
For example, in the first row of df
for the first column (a) i observe the class 2, so the value I am interested in is the second class in the first instance of the parameters
dataframe, namely 1127 that will populate the first row and column of new df
.例如,在第一列 (a) 的df
的第一行中,我观察到 class 2,所以我感兴趣的值是parameters
Z6A8064B5DF4794555500553C117C50 的第一个实例中的第二个 class ,即和new df
列。 Following this method, the first observation for the column "b" is class 0, so in the first row, column b of the new_df
I would like to observe 1460 and so on.按照这种方法,“b”列的第一个观察结果是 class 0,所以在第一行, new_df
的 b 列我想观察 1460 等等。
With a for loop I can obtain the desired result:使用 for 循环,我可以获得所需的结果:
new_df = pd.DataFrame(0, index=list(range(100)), columns=columns) # initialize the df
for i in range(len(df)):
for c in df.columns:
new_df.iloc[i][c] = parameters.loc[i][c][df.iloc[i][c]]
new_df
a b c d
0 1639 1460 467 1239
1 1124 872 806 344
2 1083 511 1706 1500
3 958 1155 1268 563
4 14 242 777 1370
.. ... ... ... ...
95 1435 1316 1709 755
96 346 712 363 815
97 1234 985 683 1348
98 127 1130 1009 1014
99 1370 825 1025 1855
However, the original dataset contains millions of rows and hundreds of columns, and proceeding with for loop is unfeasible.但是,原始数据集包含数百万行和数百列,继续进行 for 循环是不可行的。
Is there a way to vectorize such a problem in order to avoid for loops?有没有办法对这样的问题进行矢量化以避免 for 循环? (at least over 1 dimension) (至少超过 1 个维度)
Reshape both DataFrames, using stack
, into a long format, then perform the merge and reshape, with unstack
, back to the wide format.使用stack
将两个 DataFrame 重新整形为长格式,然后使用unstack
执行合并和整形,恢复为宽格式。 There's a bunch of renaming just so we can reference and align the columns in the merge.有一堆重命名只是为了我们可以引用和对齐合并中的列。
(df.rename_axis(index='instances', columns='cols').stack().to_frame('classes')
.merge(parameters.rename_axis(columns='cols').stack().rename('vals'),
on=['instances', 'classes', 'cols'])
.unstack(-1)['vals']
.rename_axis(index=None, columns=None)
)
a b c d
0 1639 1460 467 1239
1 1124 872 806 344
2 1083 511 1706 1500
3 958 1155 1268 563
4 14 242 777 1370
.. ... ... ... ...
95 1435 1316 1709 755
96 346 712 363 815
97 1234 985 683 1348
98 127 1130 1009 1014
99 1370 825 1025 1855
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.