简体   繁体   English

Pandas:Dataframe itertuples 布尔系列 groupby 优化

[英]Pandas: Dataframe itertuples boolean series groupby optimization

I'm new in python.我是python的新手。 I have data frame (DF) example:我有数据框(DF)示例:

id ID type类型
1 1 A一种
1 1 B
2 2 C C
2 2 B

I would like to add a column example A_flag group by id.我想按 id 添加列示例 A_flag 组。 In the end I have data frame (DF):最后我有数据框(DF):

id ID type类型 A_flag一只旗
1 1 A一种 1 1
1 1 B 1 1
2 2 C C 0 0
2 2 B 0 0

I can do this in two step:我可以分两步做到这一点:

  • DF['A_flag_tmp'] = [1 if x.type=='A' else 0 for x in DF.itertuples()]
  • DF['A_flag'] = DF.groupby(['id'])['A_flag_tmp'].transform(np.max)

It's working, but it's very slowy for big data frame.它正在工作,但对于大数据帧来说非常慢。 Is there any way to optimize this case ?有没有办法优化这种情况? Thank's for help.感谢帮助。

Change your codes with slow iterative coding to fast vectorized coding by replacing your first step to generate a boolean series by Pandas built-in functions, eg通过替换 Pandas 内置函数生成布尔系列的第一步,将慢速迭代编码的代码更改为快速矢量化编码,例如

df['type'].eq('A')

Then, you can attach it to the groupby statement for second step, as follows:然后,您可以将其附加到第二步的 groupby 语句中,如下所示:

df['A_flag'] = df['type'].eq('A').groupby(df['id']).transform('max').astype(int)

Result结果

print(df)


   id type  A_flag
0   1    A       1
1   1    B       1
2   2    C       0
3   2    B       0

In general, if you have more complicated conditions, you can also define it in vectorized way, eg.一般来说,如果你有更复杂的条件,你也可以用矢量化的方式定义它,例如。 define the boolean series m by:通过以下方式定义布尔系列m

m = df['type'].eq('A') & df['type1'].gt(1)  | (df['type2'] != 0)

Then, use it in step 2 as follows:然后,在步骤 2 中使用它,如下所示:

m.groupby(df['id']).transform('max').astype(int)    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM