[英]Pandas: Dataframe itertuples boolean series groupby optimization
I'm new in python.我是python的新手。 I have data frame (DF) example:
我有数据框(DF)示例:
id ![]() |
type![]() |
---|---|
1 ![]() |
A![]() |
1 ![]() |
B![]() |
2 ![]() |
C ![]() |
2 ![]() |
B![]() |
I would like to add a column example A_flag group by id.我想按 id 添加列示例 A_flag 组。 In the end I have data frame (DF):
最后我有数据框(DF):
id ![]() |
type![]() |
A_flag![]() |
---|---|---|
1 ![]() |
A![]() |
1 ![]() |
1 ![]() |
B![]() |
1 ![]() |
2 ![]() |
C ![]() |
0 ![]() |
2 ![]() |
B![]() |
0 ![]() |
I can do this in two step:我可以分两步做到这一点:
DF['A_flag_tmp'] = [1 if x.type=='A' else 0 for x in DF.itertuples()]
DF['A_flag'] = DF.groupby(['id'])['A_flag_tmp'].transform(np.max)
It's working, but it's very slowy for big data frame.它正在工作,但对于大数据帧来说非常慢。 Is there any way to optimize this case ?
有没有办法优化这种情况? Thank's for help.
感谢帮助。
Change your codes with slow iterative coding to fast vectorized coding by replacing your first step to generate a boolean series by Pandas built-in functions, eg通过替换 Pandas 内置函数生成布尔系列的第一步,将慢速迭代编码的代码更改为快速矢量化编码,例如
df['type'].eq('A')
Then, you can attach it to the groupby statement for second step, as follows:然后,您可以将其附加到第二步的 groupby 语句中,如下所示:
df['A_flag'] = df['type'].eq('A').groupby(df['id']).transform('max').astype(int)
Result结果
print(df)
id type A_flag
0 1 A 1
1 1 B 1
2 2 C 0
3 2 B 0
In general, if you have more complicated conditions, you can also define it in vectorized way, eg.一般来说,如果你有更复杂的条件,你也可以用矢量化的方式定义它,例如。 define the boolean series
m
by:通过以下方式定义布尔系列
m
:
m = df['type'].eq('A') & df['type1'].gt(1) | (df['type2'] != 0)
Then, use it in step 2 as follows:然后,在步骤 2 中使用它,如下所示:
m.groupby(df['id']).transform('max').astype(int)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.