將不同聚合函數應用於 pandas dataframe 的不同列的 Pythonic 方式？並有效地命名列？

Question

我的問題

在 SQL 中，很容易將不同的聚合函數應用於不同的列，例如：

select item, sum(a) as [sum of a], avg(b) as [avg of b], min(c) as [min of c]

在 pandas 中，沒有那么多。 此處提供的解決方案已被棄用：

df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
                                "unemp": {"mean_unemp": "mean"}})

我的解決方案

我設法找到的最差的解決方案，主要基於我再也找不到的其他堆棧溢出問題，類似於底部的玩具示例，其中我：

用我需要的所有計算定義一個 function
分別計算每一列，然后將它們放在一起 dataframe
應用 function 作為 lambda function：

我想改進的：命名列

如果您只有 2 或 3 列要創建，則此解決方案非常棒。

但是，如果您有許多列要計算，那么命名它們會變得繁瑣且非常容易出錯：我必須創建一個包含列名的列表，並將該列表作為由 function 創建的 dataframe 的索引傳遞。

現在想象我已經有 12 列，需要再添加 3 列； 我可能會造成一些混淆並以錯誤的順序添加相應的列名。

將此與 SQL 進行比較，您在定義計算后立即分配名稱 - 區別在於白天和黑夜。

有沒有更好的辦法？ 例如，在我定義計算的同時分配列名稱的方法？

為什么這不是重復的問題

該問題的重點是如何命名列，以最大程度地減少錯誤和混淆的風險。 基於現已棄用的 pandas 功能或提供自動命名列的答案，有一些類似的問題，但據我所知，沒有問題關注這一點。

玩具示例

import pandas as pd
import numpy as np

df = pd.DataFrame(columns =['a','b','c','d'], data = np.random.rand(300,4))
df['city'] = np.repeat(['London','New York','Buenos Aires'], 100)

def func(x, df):
    # func() gets called within a lambda function; x is the row, df is the entire table    
    b1 = x['a'].sum()
    b2 = x['a'].sum() / df['a'].sum() if df['a'].sum() !=0 else np.nan
    
    b3 = x['b'].mean()
    
    b4 = ( x['a'] * x['b']).sum() / x['b'].sum() if x['b'].sum() >0 else np.nan
    
    b5 = x['c'].sum()
    b6 = x['d'].sum()
    
    
    cols = ['sum of a',
            '% of a',
            'avg of b',
            'weighted avg of a, weighted by b', 
            'sum of c',
            'sum of d']
    

    return pd.Series( [b1, b2, b3, b4, b5, b6] , index = cols ) 

out = df.groupby('city').apply(lambda x: func(x,df))

Answer 1

我不是專家，但我通常會使用這樣的字典：

import pandas as pd
import numpy as np

df = pd.DataFrame(columns =['a','b','c','d'], data = np.random.rand(300,4))
df['city'] = np.repeat(['London','New York','Buenos Aires'], 100)

def func(x, df):
    # func() gets called within a lambda function; x is the row, df is the entire table   
    s_dict = {}

    s_dict['sum of a'] = x['a'].sum()
    s_dict['% of a'] = x['a'].sum() / df['a'].sum() if df['a'].sum() !=0 else np.nan
    
    s_dict['avg of b'] = x['b'].mean()
    
    s_dict['weighted avg of a, weighted by b'] = ( x['a'] * x['b']).sum() / x['b'].sum() if x['b'].sum() >0 else np.nan
    
    s_dict['sum of c'] = x['c'].sum()
    s_dict['sum of d'] = x['d'].sum()
    
    return pd.Series( s_dict  ) 

out = df.groupby('city').apply(lambda x: func(x,df))

將不同聚合函數應用於 pandas dataframe 的不同列的 Pythonic 方式？並有效地命名列？

問題描述

我的問題

我的解決方案

我想改進的：命名列

為什么這不是重復的問題

玩具示例

1 個解決方案

解決方案1
1 已采納 2021-02-14 15:38:21

將不同聚合函數應用於 pandas dataframe 的不同列的 Pythonic 方式？ 並有效地命名列？

問題描述

我的問題

我的解決方案

我想改進的：命名列

為什么這不是重復的問題

玩具示例

1 個解決方案

解決方案1 1 已采納 2021-02-14 15:38:21

將不同聚合函數應用於 pandas dataframe 的不同列的 Pythonic 方式？並有效地命名列？

解決方案1
1 已采納 2021-02-14 15:38:21