简体   繁体   中英

Python Pandas: efficiently aggregating different functions on different columns and combining the resulting columns together

So far my approach to the task described in the title is quite straightforward, yet it seems somewhat inefficient/unpythonic. An example of what I usually do is as follows:


The original Pandas DataFrame df has 6 columns: 'open', 'high', 'low', 'close', 'volume', 'new dt'

import pandas as pd

df_gb = df.groupby('new dt')

arr_high = df_gb['high'].max()
arr_low = df_gb['low'].min()
arr_open = df_gb['open'].first()
arr_close = df_gb['close'].last()
arr_volumne = df_gb['volume'].sum()

df2 = pd.concat([arr_open,
                 arr_high,
                 arr_low,
                 arr_close,
                 arr_volumne], axis = 'columns')

It may seem already efficient at first glance, but when I have 20 functions waiting to apply on 20 different columns, it quickly becomes unpythonic/inefficient.

Is there any way to make it more efficient/pythonic? Thank you in advance

If you have 20 different functions you will have to properly match columns with functions anyways. The term pythonic can be subjective so this is not the correct answer but potentially useful. Your approach is pythonic in my opinion and it kinda details what is happening properly

# as long as the columns are ordered with the proper functions
# you may have to change the ordering here
columns_to_agg = (column for column in df.columns if column != 'new dt')

# if the functions are all methods of pandas.Series just use strings
agg_methods = ['first', 'max', 'min', 'last', 'sum']

# construct a dictionary and use it as aggregator
agg_dict = dict((el[0], el[1]) for el in zip(columns_to_agg, agg_methods))
df_gb = df.groupby('new dt', as_index=False).agg(agg_dict)

If you have custom functions you wanted to apply to, say volume, you could do


def custom_f(series):
    return pd.notnull(series).sum()
agg_methods = ['first', 'max', 'min', 'last', custom_f]

Everything else will be fine. You could even do this to apply sum and custom_f to your volume column

agg_methods = ['first', 'max', 'min', 'last', ['sum', custom_f]]
In [3]: import pandas as pd                                                     
In [4]: import numpy as np                                                      
In [5]: df = pd.DataFrame([[1, 2, 3],[4, 5, 6],[7, 8, 9], 
...: [np.nan, np.nan, np.nan]],columns=['A', 'B', 'C']) 

In [6]: df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})                    
Out[6]: 
        A    B
max   NaN  8.0
min   1.0  2.0
sum  12.0  NaN

For functions as column:

In [11]: df.agg({'A' : ['sum'], 'B' : ['min', 'max']}).T                        
Out[11]: 
   max  min   sum
A  NaN  NaN  12.0
B  8.0  2.0   NaN

For using custom functions you can do like this:

In [12]: df.agg({'A' : ['sum',lambda x:x.mean()], 'B' : ['min', 'max']}).T      
Out[12]: 
   <lambda>  max  min   sum
A       4.0  NaN  NaN  12.0
B       NaN  8.0  2.0   NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM