简体   繁体   中英

Pandas Group Weighted Average of Multiple Columns

Say I have the following dataframe:

>>> df=pd.DataFrame({'category':['a','a','b','b'],
... 'var1':np.random.randint(0,100,4),
... 'var2':np.random.randint(0,100,4),
... 'weights':np.random.randint(0,10,4)})
>>> df
  category  var1  var2  weights
0        a    37    36        7
1        a    47    20        1
2        b    33     7        6
3        b    16     6        8

I can calculate the weighted average of a 'var1' as such:

>>> Grouped=df.groupby('category')
>>> GetWeightAvg=lambda g: np.average(g['var1'], weights=g['weights'])
>>> Grouped.apply(GetWeightAvg)
category
a    38.250000
b    23.285714
dtype: float64

However I am wondering if there is a way I can write my function and apply it to my grouped object such that I can specify when applying it, which column I want to calculate for (or both). Rather than have 'var1' written into my function, I'd like to be able to specify when applying the function.

Just as I can get an unweighted average of both columns like this:

>>> Grouped[['var1','var2']].mean()
          var1  var2
category            
a         42.0  28.0
b         24.5   6.5

I'm wondering if there is a parallel way to do that with weighted averages.

You can apply and return both averages:

In [11]: g.apply(lambda x: pd.Series(np.average(x[["var1", "var2"]], weights=x["weights"], axis=0), ["var1", "var2"]))
Out[11]:
               var1       var2
category
a         38.250000  34.000000
b         23.285714   6.428571

You could write this slightly cleaner as a function:

In [21]: def weighted(x, cols, w="weights"):
             return pd.Series(np.average(x[cols], weights=x[w], axis=0), cols)

In [22]: g.apply(weighted, ["var1", "var2"])
Out[22]:
               var1       var2
category
a         38.250000  34.000000
b         23.285714   6.428571

Following up from Andy's solution, I was seeking to use one of the index levels from a multi index as my weights.

np.random.seed(1)
arrays = [list('AAABBB'), [0.01,0.02,0.03,0.07,0.09,0.11]]
tups = list(zip(*arrays))
x = pd.MultiIndex.from_tuples(tups)
df = pd.DataFrame(index=x,data= np.random.randint(10,100,(6,6)),columns = list('STUVWX'))
df.index.names = ['bin','prob']

             S   T   U   V   W   X
bin prob                          
A   0.0100  47  22  82  19  85  15
    0.0200  89  74  26  11  86  81
    0.0300  16  35  60  30  28  94
B   0.0700  21  38  39  24  60  78
    0.0900  97  97  96  23  19  17
    0.1100  73  71  32  67  11  10

Adapting function to use one of index levels as the weights.

def weighted(x, w="weights"):
    return pd.Series(np.average(x, weights=x.index.get_level_values(w), axis=0),index= x.columns)

and calling

df.groupby(level=['bin']).apply(weighted,  "prob")

which gives:

          S       T       U       V       W       X
bin                                                
A   45.5000 45.8333 52.3333 21.8333 56.8333 76.5000
B   67.5185 71.1111 55.1481 41.1852 26.3704 29.9630

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM