简体   繁体   English

如何通过多列函数对Pandas行进行分组

[英]How to group Pandas rows by a function of multiple columns

I have a dataframe with records characterizing roof surfaces of buildings, so each building has multiple planes, with an area and a description of its form. 我有一个数据框,其中包含描述建筑物屋顶表面的记录,因此每个建筑物都有多个平面,其中包含一个区域和一个形状的描述。 eg 例如

df=pd.DataFrame([[1000, 12, 'slope'],
                [1000, 10, 'flat'],
                [1001, 10, 'slope'],
                [1001, 15, 'flat'],
                [1001, 7, 'slope']],
               index = [1,2,3,4,5],
               columns=['building_id', 'area', 'form'],
               )
df
building_id     area    form
1   1000    12  slope
2   1000    10  flat
3   1001    10  slope
4   1001    15  flat
5   1001    7   slope

I want to combine the rows so i have one for each building, with the total roof area and the predominant roof form - ie the form that has the greatest area for that building, not the form that appears most frequently: 我希望将这些行组合在一起,因此每个建筑物都有一个,总屋顶面积和主要屋顶形式 - 即具有该建筑物最大面积的形式,而不是最常出现的形式:

df_out
building_id     area    form
    1   1000    22  slope
    2   1001    32  slope

I need something like this: 我需要这样的东西:

group_functions={'area' : ['sum'],
                 'form' : lambda x: find_predominant(x)}
df_out = df.groupby('building_id').agg(group_functions)

But find_predominant needs to be a function of area as well as form : It returns the string 'flat' or 'slope' depending on which has the biggest area for that building_id . 但是find_predominant需要是areaform的函数:它返回字符串'flat''slope'具体取决于具有该building_id的最大区域。

What is the function find_predominant ? find_predominant的功能是什么? Or what script will have the same effect? 或者哪个脚本会产生相同的效果?

My suggestion would be to calculate the sum and call the find_predomonant function separately, since that will require a call to apply . 我的建议是计算总和并单独调用find_predomonant函数,因为这需要调用apply

g = df.groupby('building_id')
area = g['area'].sum()
form = g.apply(find_predominant) 

df_out = pd.concat([area, form], axis=1)

Now, for this to work, please recognise that find_predominant should accept a DataFrame and access the "area" and "form" columns appropriately. 现在,为了使其正常工作,请注意find_predominant应该接受DataFrame并适当地访问“area”和“form”列。

def find_predominant(df):
    ar = df['area']
    fm = df['form']
    ... # Do something with ar and fm

    return result

That may or may not require refactoring on your part. 这可能需要也可能不需要您进行重构。


Edit: Okay, so you don't know what that function is. 编辑:好的,所以你不知道那是什么功能。 In that case, let's get rid of it. 在那种情况下,让我们摆脱它。

Try this. 试试这个。

area = df.groupby('building_id')['area'].sum()
form = (df.groupby(['building_id', 'form'])['area']
          .sum()
          .groupby(level=0)
          .idxmax()
          .str[1])
form.name = 'form'

df_out = pd.concat([area, form], axis=1).reset_index()
print(df_out)
   building_id  area   form
0         1000    22  slope
1         1001    32  slope

This will select the form corresponding to the one holding the maximum area (by sum) per building_id. 这将选择与每个building_id保持最大区域(按总和)的形式相对应的形式。

If the form by maximum sum is not required, and you simply want the form by maximum area, then the solution simplifies. 如果不需要最大总和的形式,并且您只需要最大面积的表格,那么解决方案将简化。

g = df.groupby('building_id')['area']
area = g.sum()
form = (df.set_index('building_id')
          .iloc[g.idxmax(), df.columns.get_loc('form') - 1])

df_out = pd.concat([area, form], axis=1).reset_index()
print(df_out)
   building_id  area   form
0         1000    22   flat
1         1001    32  slope

You can use sort_values and assign the value after agg 您可以使用sort_values并在agg之后分配值

(df.groupby(['building_id','form'])['area']
   .sum()
   .sort_values()
   .reset_index(level=1)
   .groupby(level=0)
   .agg({'form':'last','area':'sum'}))

              form  area
building_id             
1000         slope    22
1001         slope    32

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM