简体   繁体   English

将不同聚合函数应用于 pandas dataframe 的不同列的 Pythonic 方式? 并有效地命名列?

[英]Pythonic way to apply different aggregate functions to different columns of a pandas dataframe? And to name the columns efficiently?

My issue我的问题

In SQL it is very easy to apply different aggregate functions to different columns, eg :在 SQL 中,很容易将不同的聚合函数应用于不同的列,例如:

select item, sum(a) as [sum of a], avg(b) as [avg of b], min(c) as [min of c]

In pandas, not so much.在 pandas 中,没有那么多。 The solution provided here became deprecated: 此处提供的解决方案已被弃用:

df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
                                "unemp": {"mean_unemp": "mean"}})

My solution我的解决方案

The least worst solution I have managed to find, mostly based on other stack overflow questions I can no longer find, is something like the toy example at the bottom, where I:我设法找到的最差的解决方案,主要基于我再也找不到的其他堆栈溢出问题,类似于底部的玩具示例,其中我:

  • define a function with all the calculations I need用我需要的所有计算定义一个 function
  • calculate each column separately, then put them together in a dataframe分别计算每一列,然后将它们放在一起 dataframe
  • apply the function as a lambda function:应用 function 作为 lambda function:

What I would like to improve: naming columns我想改进的:命名列

If you only have 2 or 3 columns to create, this solution is great.如果您只有 2 或 3 列要创建,则此解决方案非常棒。

However, if you have many columns to calculate, naming them becomes fiddly and very error-prone: I have to create a list with the column names, and pass that list as the index of the dataframe created by the function.但是,如果您有许多列要计算,那么命名它们会变得繁琐且非常容易出错:我必须创建一个包含列名的列表,并将该列表作为由 function 创建的 dataframe 的索引传递。

Now imagine I already have 12 columns and need to add 3 more;现在想象我已经有 12 列,需要再添加 3 列; there's a chance I may make some confusion and add the corresponding column names in the wrong order.我可能会造成一些混淆并以错误的顺序添加相应的列名。

Compare this with SQL, where you assign the name right after defining the calculation - the difference is night and day.将此与 SQL 进行比较,您在定义计算后立即分配名称 - 区别在于白天和黑夜。

Is there a better way?有没有更好的办法? Eg a way to assign the name of the column at the same time I define the calculation ?例如,在我定义计算的同时分配列名称的方法

Why this is not a duplicate question为什么这不是重复的问题

The focus of the question is specifically on how to name the columns so as to minimise the risk of errors and confusion.该问题的重点是如何命名列,以最大程度地减少错误和混淆的风险。 There are somewhat similar questions based on now deprecated functionalities of pandas, or with answers which provide an automatic naming of the columns but, to my knowledge, no question which focuses on this very point.基于现已弃用的 pandas 功能或提供自动命名列的答案,有一些类似的问题,但据我所知,没有问题关注这一点。

Toy example玩具示例

import pandas as pd
import numpy as np

df = pd.DataFrame(columns =['a','b','c','d'], data = np.random.rand(300,4))
df['city'] = np.repeat(['London','New York','Buenos Aires'], 100)

def func(x, df):
    # func() gets called within a lambda function; x is the row, df is the entire table    
    b1 = x['a'].sum()
    b2 = x['a'].sum() / df['a'].sum() if df['a'].sum() !=0 else np.nan
    
    b3 = x['b'].mean()
    
    b4 = ( x['a'] * x['b']).sum() / x['b'].sum() if x['b'].sum() >0 else np.nan
    
    b5 = x['c'].sum()
    b6 = x['d'].sum()
    
    
    cols = ['sum of a',
            '% of a',
            'avg of b',
            'weighted avg of a, weighted by b', 
            'sum of c',
            'sum of d']
    

    return pd.Series( [b1, b2, b3, b4, b5, b6] , index = cols ) 

out = df.groupby('city').apply(lambda x: func(x,df))

I am not an expert but what I usually do is use a dictionary like this:我不是专家,但我通常会使用这样的字典:

import pandas as pd
import numpy as np

df = pd.DataFrame(columns =['a','b','c','d'], data = np.random.rand(300,4))
df['city'] = np.repeat(['London','New York','Buenos Aires'], 100)

def func(x, df):
    # func() gets called within a lambda function; x is the row, df is the entire table   
    s_dict = {}

    s_dict['sum of a'] = x['a'].sum()
    s_dict['% of a'] = x['a'].sum() / df['a'].sum() if df['a'].sum() !=0 else np.nan
    
    s_dict['avg of b'] = x['b'].mean()
    
    s_dict['weighted avg of a, weighted by b'] = ( x['a'] * x['b']).sum() / x['b'].sum() if x['b'].sum() >0 else np.nan
    
    s_dict['sum of c'] = x['c'].sum()
    s_dict['sum of d'] = x['d'].sum()
    
    return pd.Series( s_dict  ) 

out = df.groupby('city').apply(lambda x: func(x,df))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将不同的函数应用于pandas数据帧上的不同列 - How to apply different functions to different columns on pandas dataframe 熊猫:将不同的功能应用于不同的列 - Pandas: apply different functions to different columns python pandas:将不同的聚合函数应用于不同的列 - python pandas: applying different aggregate functions to different columns 使用 resample 为 Pandas 数据框中的不同列聚合具有不同规则的数据 - using resample to aggregate data with different rules for different columns in a pandas dataframe 在 dataframe 的每一列上使用具有不同功能的不同列上使用 apply() - Using apply() on different columns with different functions on each column of a dataframe 使用单个 pandas groupby 命令将不同的功能应用于不同的列 - Apply different functions to different columns with a singe pandas groupby command 如何对熊猫中的单独列使用不同的聚合函数? -蟒蛇 - how to use different aggregate functions for separate columns in pandas? - python 将多个函数应用于 Pandas DataFrame 返回几列的有效方法 - Efficient way to apply several functions to Pandas DataFrame returning several columns Python Pandas:有效地汇总不同列上的不同函数并将结果列组合在一起 - Python Pandas: efficiently aggregating different functions on different columns and combining the resulting columns together Pandas .apply 有条件的 if 在不同的列 - Pandas .apply with conditional if in different columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM