[英]How to group Pandas rows by a function of multiple columns
I have a dataframe with records characterizing roof surfaces of buildings, so each building has multiple planes, with an area and a description of its form. 我有一个数据框,其中包含描述建筑物屋顶表面的记录,因此每个建筑物都有多个平面,其中包含一个区域和一个形状的描述。 eg 例如
df=pd.DataFrame([[1000, 12, 'slope'],
[1000, 10, 'flat'],
[1001, 10, 'slope'],
[1001, 15, 'flat'],
[1001, 7, 'slope']],
index = [1,2,3,4,5],
columns=['building_id', 'area', 'form'],
)
df
building_id area form
1 1000 12 slope
2 1000 10 flat
3 1001 10 slope
4 1001 15 flat
5 1001 7 slope
I want to combine the rows so i have one for each building, with the total roof area and the predominant roof form - ie the form that has the greatest area for that building, not the form that appears most frequently: 我希望将这些行组合在一起,因此每个建筑物都有一个,总屋顶面积和主要屋顶形式 - 即具有该建筑物最大面积的形式,而不是最常出现的形式:
df_out
building_id area form
1 1000 22 slope
2 1001 32 slope
I need something like this: 我需要这样的东西:
group_functions={'area' : ['sum'],
'form' : lambda x: find_predominant(x)}
df_out = df.groupby('building_id').agg(group_functions)
But find_predominant
needs to be a function of area
as well as form
: It returns the string 'flat'
or 'slope'
depending on which has the biggest area for that building_id
. 但是find_predominant
需要是area
和form
的函数:它返回字符串'flat'
或'slope'
具体取决于具有该building_id
的最大区域。
What is the function find_predominant
? find_predominant
的功能是什么? Or what script will have the same effect? 或者哪个脚本会产生相同的效果?
My suggestion would be to calculate the sum and call the find_predomonant
function separately, since that will require a call to apply
. 我的建议是计算总和并单独调用find_predomonant
函数,因为这需要调用apply
。
g = df.groupby('building_id')
area = g['area'].sum()
form = g.apply(find_predominant)
df_out = pd.concat([area, form], axis=1)
Now, for this to work, please recognise that find_predominant
should accept a DataFrame and access the "area" and "form" columns appropriately. 现在,为了使其正常工作,请注意find_predominant
应该接受DataFrame并适当地访问“area”和“form”列。
def find_predominant(df):
ar = df['area']
fm = df['form']
... # Do something with ar and fm
return result
That may or may not require refactoring on your part. 这可能需要也可能不需要您进行重构。
Edit: Okay, so you don't know what that function is. 编辑:好的,所以你不知道那是什么功能。 In that case, let's get rid of it. 在那种情况下,让我们摆脱它。
Try this. 试试这个。
area = df.groupby('building_id')['area'].sum()
form = (df.groupby(['building_id', 'form'])['area']
.sum()
.groupby(level=0)
.idxmax()
.str[1])
form.name = 'form'
df_out = pd.concat([area, form], axis=1).reset_index()
print(df_out)
building_id area form
0 1000 22 slope
1 1001 32 slope
This will select the form corresponding to the one holding the maximum area (by sum) per building_id. 这将选择与每个building_id保持最大区域(按总和)的形式相对应的形式。
If the form by maximum sum is not required, and you simply want the form by maximum area, then the solution simplifies. 如果不需要最大总和的形式,并且您只需要最大面积的表格,那么解决方案将简化。
g = df.groupby('building_id')['area']
area = g.sum()
form = (df.set_index('building_id')
.iloc[g.idxmax(), df.columns.get_loc('form') - 1])
df_out = pd.concat([area, form], axis=1).reset_index()
print(df_out)
building_id area form
0 1000 22 flat
1 1001 32 slope
You can use sort_values
and assign the value after agg
您可以使用sort_values
并在agg
之后分配值
(df.groupby(['building_id','form'])['area']
.sum()
.sort_values()
.reset_index(level=1)
.groupby(level=0)
.agg({'form':'last','area':'sum'}))
form area
building_id
1000 slope 22
1001 slope 32
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.