简体   繁体   English

如何加快大型 pandas dataframe 的数据标记速度?

[英]How can i speed up data labelling for a large pandas dataframe?

I have a large pandas data frame which roughly looks this我有一个大的 pandas 数据框,大致看起来像这样

  Identity  periods      one        two       three     Label
0   one      1       -0.462407    0.022811  -0.277357
1   one      1       -0.617588    1.667191  -0.370436
2   one      2       -0.604699    0.635473  -0.556088
3   one      2       -0.852943    1.087415  -0.784377
4   two      3        0.421453    2.390097   0.176333
5   two      3       -0.447321   -1.215280  -0.187156
6   two      4        0.398953   -0.334095  -1.194132
7   two      4       -0.324348   -0.842357   0.970825

I need to be able to categorise the data according to groupings in the various columns, for example one of my categorisation criteria is to label each of the groups in the identity column with a label if there is between x and y periods in the periods column.我需要能够根据各个列中的分组对数据进行分类,例如,我的分类标准之一是 label 标识列中的每个组,如果周期列中有 x 和 y 周期,则使用 label .

The code I have to categorise this looks like this, generating a final column:我必须对其进行分类的代码如下所示,生成最后一列:

for i in df['Identity'].unique():
    if (2 <= df[df['Identity']==i]['periods'].max() <= 5) :
        df.loc[df['Identity']==i,'label']='label 1'

I have also tried a version using我也尝试过使用的版本

df.groupby('Identity').apply().

But this is no quicker.但这并没有更快。

My data is approximately 2.8m rows at the moment, and there are about 900 unique identities.我的数据目前大约有 280 万行,大约有 900 个唯一身份。 The code takes about 5 minutes to run, which to me suggests it's the code within the loop that is slow, rather than the looping making it slow.代码运行大约需要 5 分钟,这对我来说表明循环中的代码很慢,而不是循环使它变慢。

Let's try to enhance the system performance by using all vectorized Pandas operations instead of using loops or .apply() function which is also just commonly using the relatively slow Python loops internally.让我们尝试通过使用所有矢量化 Pandas 操作而不是使用循环或.apply() function 来提高系统性能,这也只是通常使用相对较慢的 ZA7F5F35426B92741173Z 内部循环。

Use .groupby() and .transform() to broadcast max() of periods within group to get a series for making mask.使用.groupby().transform()广播组内periodsmax()以获得一系列用于制作掩码。 Then use .loc[] with the mask of the condition 2 <= max <=5 and setup label for such rows fulfulling the mask.然后将.loc[]与条件 2 <= max <=5 的掩码一起使用,并为满足掩码的此类行设置 label。

Assumed same label for all rows of same Identity group whenever the max period within the group is within 2 <= max <=5.只要组内的最大周期在 2 <= max <=5 以内,就假定同一Identity组的所有行的 label 相同。

m = df.groupby('Identity')['periods'].transform('max')
df.loc[(m >=2) & (m <=5), 'Label'] = 'label 1'


print(df)

  Identity  periods       one       two     three    Label
0      one        1 -0.462407  0.022811 -0.277357  label 1
1      one        1 -0.617588  1.667191 -0.370436  label 1
2      one        2 -0.604699  0.635473 -0.556088  label 1
3      one        2 -0.852943  1.087415 -0.784377  label 1
4      two        3  0.421453  2.390097  0.176333  label 1
5      two        3 -0.447321 -1.215280 -0.187156  label 1
6      two        4  0.398953 -0.334095 -1.194132  label 1
7      two        4 -0.324348 -0.842357  0.970825  label 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM