简体   繁体   English

Pandas groupby 根据条件创建新列

[英]Pandas groupby create new column based on a condition

In the table below, I want to produce the column new area for the group created by address related fields X,Y,Z (Groupby XYZ).在下表中,我想为由地址相关字段 X、Y、Z(Groupby XYZ)创建的组生成列新区域 If in the code column, if the value is A, just count that area only once and add the remaining area for other codes.如果在代码列中,如果值为A,则只计算该区域一次,并将剩余区域添加到其他代码中。

So for this group, the new area should be 100(A)+200(B)+300(C)= 600. Note that can't take the sum since A is repeated twice.所以对于这个组,新的面积应该是100(A)+200(B)+300(C)= 600。注意不能取和,因为A重复了两次。 Just want one area for value A to be counted in the sum, not all of them只想将值 A 的一个区域计入总和,而不是全部

在此处输入图像描述

To get the above table:获取上表:

df['X'] = ['222 North St','222 North St','222 North St','222 North St','115 John St','115 John St','115 John St']
df['Y'] = ['Seattle','Seattle','Seattle','Seattle','Chicago','Chicago','Chicago']
df['Z'] = ['WA','WA','WA','WA','IL','IL','IL']
df['code'] = ['A','B','A','C','A','A','B']
df['area'] = [100,200,100,300,200,200,50]```

So this works, but not sure if it's the most efficient way.所以这行得通,但不确定它是否是最有效的方法。 Since you didn't specify which code you wanted to take when there were multiple, I assumed they would hold the same value for area and so dropped duplicates.由于您没有指定在有多个代码时要采用哪个代码,因此我假设它们的area值将保持相同,因此会删除重复项。

import pandas as pd 

df = pd.DataFrame()
df['X'] = ['222 North St','222 North St','222 North St','222 North St','115 John St','115 John St','115 John St']
df['Y'] = ['Seattle','Seattle','Seattle','Seattle','Chicago','Chicago','Chicago']
df['Z'] = ['WA','WA','WA','WA','IL','IL','IL']
df['code'] = ['A','B','A','C','A','A','B']
df['area'] = [100,200,100,300,200,200,50]

df2 = df.drop_duplicates(subset=['X','Y','Z','code']).groupby(['X','Y','Z']).agg({'area':'sum'}).reset_index()
df = pd.merge(df,df2,how='left',on=['X','Y','Z']).rename(columns={'area_x':'area','area_y':'area sum'})

Also, if you were able to provide the first part of the above code yourself, you'd attract more people to try and answer your question.此外,如果您能够自己提供上述代码的第一部分,您将吸引更多人尝试回答您的问题。

EDIT:编辑:

# drop duplicates but only for code = A
df_A = df[df['code']=='A'].drop_duplicates(subset=['X','Y','Z','code'])

# groupby and sum now that A only appears once - this creates the 'area sum'
df2 = pd.concat([df[df['code']!='A'],df_A]).groupby(['X','Y','Z']).agg({'area':'sum'}).reset_index()

# merge onto original dataframe
df = pd.merge(df,df2,how='left',on=['X','Y','Z']).rename(columns={'area_x':'area','area_y':'area sum'})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM