简体   繁体   中英

Pandas: aggregating multiple columns with multiple functions

Pandas in Python and Dplyr in R are both flexible data wrangling tools. For example, in R, with dplyr one can do the following;

custom_func <- function(col1, col2) length(col1) + length(col2)

ChickWeight %>% 
  group_by(Diet) %>% 
  summarise(m_weight = mean(weight), 
            var_time = var(Time), 
            covar = cov(weight, Time),
            odd_stat = custom_func(weight, Time))

Notice how in one statement;

  • I can aggregate over multiple columns in one line.
  • I can apply different functions over these multiple columns in one line.
  • I can use functions that take into account two columns.
  • I can throw in custom functions for any of these.
  • I can declare new column names for these aggregations.

Is such a pattern also possible in pandas? Note that I am interested in doing this in a short statement (so not creating three different dataframes and then joining them).

With pandas groupby.apply() you can run multiple functions in a groupby aggregation. Please note for statistical functions you would need scipy installed. For custom functions will need to run an aggregate like sum() for groupwise data:

def customfct(x,y):
    data = x / y
    return data.mean()

def f(row):  
    row['m_weight'] = row['weight'].mean()
    row['var_time'] = row['Time'].var()
    row['cov'] = row['weight'].cov(row['Time'])
    row['odd_stat'] = customfct(row['weight'], row['Time'])
    return row

aggdf = df.groupby('Diet').apply(f)

Since version, 0.16.0, pandas has offered the assign() method. I'm not sure exactly how it compares to pandas-ply as mentioned by @akrun, but it is part of pandas proper. I think that would handle all of your needs except maybe the groupby part (or possibly I just don't know how to combine with groupby).

Documentation for pandas assign

Inspired by dplyr's mutate verb, DataFrame has a new assign() method. The function signature for assign is simply **kwargs. The keys are the column names for the new fields, and the values are either a value to be inserted (for example, a Series or NumPy array), or a function of one argument to be called on the DataFrame. The new values are inserted, and the entire DataFrame (with all original and new columns) is returned.

Looking at your example code, I'm not sure how to do the groupby, but you certainly can do this sort of thing. (Actually, I may post a followup SO question if I can't figure it out.)

df.assign( mean_weight = lambda x: x['weight'].mean(),
           var_time    = lambda x: x['time'].var(),
           covar       = lambda x: np.cov(x.t1,x.t2)[0,0] )

Or, can do it like this:

df.assign( mean_weight = df['weight'].mean(),
           var_time    = df['time'].var(),
           covar       = np.cov(df['t1'],df['t2'])[0,0] )

Alternatively, you could include the groupby like this (mixing the alternate ways below):

df.assign( mean_weight = df['weight'].groupby(df.diet).transform('mean'),
           var_time    = lambda x: x['time'].groupby(x['diet']).transform('mean') )

But you have to do a groupby 4x, not 1x, so that's not real satisfying as a solution...

I'll play around with that syntax a little more and see if I can get it closer to your example (you may want to provide sample data for a better answer).

Alternatively you can do standard groupby techniques to get your answer, but I think it takes multiple statements (can't just do one long chained line) because of the complexity -- some of your assignments can be combined with groupby's agg but I don't think the user function can.

I am bringing datar here for you to translate your R code into python:

>>> from pipda import register_func
>>> from datar.all import f, length, group_by, summarise, mean, var, cov
>>> from datar.datasets import ChickWeight
>>> 
>>> @register_func(None)
... def custom_func(col1, col2):
...     return length(col1) + length(col2)
... 
>>> ChickWeight >> \
...     group_by(f.Diet) >> \
...     summarise(
...         m_weight=mean(f.weight),
...         var_time = var(f.Time),
...         covar = cov(f.weight, f.Time),
...         odd_stat = custom_func(f.weight, f.Time)
...     )
     Diet    m_weight   var_time       covar  odd_stat
  <int64>   <float64>  <float64>   <float64>   <int64>
0       1  102.645455  45.422397  312.189871       440
1       2  122.616667  45.909722  398.564426       240
2       3  142.950000  45.909722  528.827731       240
3       4  135.262712  45.100618  441.868535       236

I am the author of the package. Feel free to submit issues if you have any questions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM