将不同聚合函数应用于 pandas dataframe 的不同列的 Pythonic 方式？并有效地命名列？

Question

My issue我的问题

In SQL it is very easy to apply different aggregate functions to different columns, eg :在 SQL 中，很容易将不同的聚合函数应用于不同的列，例如：

select item, sum(a) as [sum of a], avg(b) as [avg of b], min(c) as [min of c]

In pandas, not so much.在 pandas 中，没有那么多。 The solution provided here became deprecated: 此处提供的解决方案已被弃用：

df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
                                "unemp": {"mean_unemp": "mean"}})

My solution我的解决方案

The least worst solution I have managed to find, mostly based on other stack overflow questions I can no longer find, is something like the toy example at the bottom, where I:我设法找到的最差的解决方案，主要基于我再也找不到的其他堆栈溢出问题，类似于底部的玩具示例，其中我：

define a function with all the calculations I need用我需要的所有计算定义一个 function
calculate each column separately, then put them together in a dataframe分别计算每一列，然后将它们放在一起 dataframe
apply the function as a lambda function:应用 function 作为 lambda function：

What I would like to improve: naming columns我想改进的：命名列

If you only have 2 or 3 columns to create, this solution is great.如果您只有 2 或 3 列要创建，则此解决方案非常棒。

However, if you have many columns to calculate, naming them becomes fiddly and very error-prone: I have to create a list with the column names, and pass that list as the index of the dataframe created by the function.但是，如果您有许多列要计算，那么命名它们会变得繁琐且非常容易出错：我必须创建一个包含列名的列表，并将该列表作为由 function 创建的 dataframe 的索引传递。

Now imagine I already have 12 columns and need to add 3 more;现在想象我已经有 12 列，需要再添加 3 列； there's a chance I may make some confusion and add the corresponding column names in the wrong order.我可能会造成一些混淆并以错误的顺序添加相应的列名。

Compare this with SQL, where you assign the name right after defining the calculation - the difference is night and day.将此与 SQL 进行比较，您在定义计算后立即分配名称 - 区别在于白天和黑夜。

Is there a better way?有没有更好的办法？ Eg a way to assign the name of the column at the same time I define the calculation ?例如，在我定义计算的同时分配列名称的方法？

Why this is not a duplicate question为什么这不是重复的问题

The focus of the question is specifically on how to name the columns so as to minimise the risk of errors and confusion.该问题的重点是如何命名列，以最大程度地减少错误和混淆的风险。 There are somewhat similar questions based on now deprecated functionalities of pandas, or with answers which provide an automatic naming of the columns but, to my knowledge, no question which focuses on this very point.基于现已弃用的 pandas 功能或提供自动命名列的答案，有一些类似的问题，但据我所知，没有问题关注这一点。

Toy example玩具示例

import pandas as pd
import numpy as np

df = pd.DataFrame(columns =['a','b','c','d'], data = np.random.rand(300,4))
df['city'] = np.repeat(['London','New York','Buenos Aires'], 100)

def func(x, df):
    # func() gets called within a lambda function; x is the row, df is the entire table    
    b1 = x['a'].sum()
    b2 = x['a'].sum() / df['a'].sum() if df['a'].sum() !=0 else np.nan
    
    b3 = x['b'].mean()
    
    b4 = ( x['a'] * x['b']).sum() / x['b'].sum() if x['b'].sum() >0 else np.nan
    
    b5 = x['c'].sum()
    b6 = x['d'].sum()
    
    
    cols = ['sum of a',
            '% of a',
            'avg of b',
            'weighted avg of a, weighted by b', 
            'sum of c',
            'sum of d']
    

    return pd.Series( [b1, b2, b3, b4, b5, b6] , index = cols ) 

out = df.groupby('city').apply(lambda x: func(x,df))

Answer 1

I am not an expert but what I usually do is use a dictionary like this:我不是专家，但我通常会使用这样的字典：

import pandas as pd
import numpy as np

df = pd.DataFrame(columns =['a','b','c','d'], data = np.random.rand(300,4))
df['city'] = np.repeat(['London','New York','Buenos Aires'], 100)

def func(x, df):
    # func() gets called within a lambda function; x is the row, df is the entire table   
    s_dict = {}

    s_dict['sum of a'] = x['a'].sum()
    s_dict['% of a'] = x['a'].sum() / df['a'].sum() if df['a'].sum() !=0 else np.nan
    
    s_dict['avg of b'] = x['b'].mean()
    
    s_dict['weighted avg of a, weighted by b'] = ( x['a'] * x['b']).sum() / x['b'].sum() if x['b'].sum() >0 else np.nan
    
    s_dict['sum of c'] = x['c'].sum()
    s_dict['sum of d'] = x['d'].sum()
    
    return pd.Series( s_dict  ) 

out = df.groupby('city').apply(lambda x: func(x,df))

将不同聚合函数应用于 pandas dataframe 的不同列的 Pythonic 方式？并有效地命名列？

问题描述

My issue我的问题

My solution我的解决方案

What I would like to improve: naming columns我想改进的：命名列

Why this is not a duplicate question为什么这不是重复的问题

Toy example玩具示例

1 个解决方案

解决方案1
1 已采纳 2021-02-14 15:38:21

将不同聚合函数应用于 pandas dataframe 的不同列的 Pythonic 方式？ 并有效地命名列？

问题描述

My issue我的问题

My solution我的解决方案

What I would like to improve: naming columns我想改进的：命名列

Why this is not a duplicate question为什么这不是重复的问题

Toy example玩具示例

1 个解决方案

解决方案1 1 已采纳 2021-02-14 15:38:21

将不同聚合函数应用于 pandas dataframe 的不同列的 Pythonic 方式？并有效地命名列？

解决方案1
1 已采纳 2021-02-14 15:38:21