简体   繁体   中英

Pandas dataframe summing with multiple groupby

I have the following dataframe:

df2 = pd.DataFrame({'season':[1,1,1,2,2,2,3,3],'value' : [-2, 3,1,5,8,6,7,5], 'avail':[3,3,3,8,8,4,25,25],'test2':[4,5,7,8,9,10,11,12]},index=['2020', '2020', '2020','2020', '2020', '2021', '2021', '2021']) 
df2.index=  pd.to_datetime(df2.index)  
df2.index = df2.index.year
print(df2)

      avail  season  test2  value
2020      3       1      4     -2
2020      3       1      5      3
2020      3       1      7      1
2020      8       2      8      5
2020      8       2      9      8
2021      4       2     10      6
2021     25       3     11      7
2021     25       3     12      5

I would like to compute efficiently for each year the sum of the 'avail' column. The difficulty here beeing to sum only one 'avail' value for each season. For instance for the year 2020 I want to sum 3+8 =11.

Expected result (column 'sum_avail'):

        avail  season  test2  value   sum_avail
2020      3       1      4     -2        11
2020      3       1      5      3        11
2020      3       1      7      1        11 
2020      8       2      8      5        11
2020      8       2      9      8        11
2021      4       2     10      6        29
2021     25       3     11      7        29
2021     25       3     12      5        29  

IIUC, transform + set

df2.groupby(level=0).avail.transform(lambda x : sum(set(x)))
Out[220]: 
2020    11
2020    11
2020    11
2020    11
2020    11
2021    29
2021    29
2021    29
Name: avail, dtype: int64

You'll need groupby + transform + np.unique :

df2['sum_avail'] = (
     df2.groupby(level=0).avail.transform(lambda x: np.unique(x).sum()))

Or,

df2['sum_avail'] = df2.groupby(level=0).avail.transform('unique').apply(sum)

df2

      avail  season  test2  value  sum_avail
2020      3       1      4     -2         11
2020      3       1      5      3         11
2020      3       1      7      1         11
2020      8       2      8      5         11
2020      8       2      9      8         11
2021      4       2     10      6         29
2021     25       3     11      7         29
2021     25       3     12      5         29

Here's an approach which takes the first value in each index/season pair and then sums them up:

res = df2.groupby([df2.index, 'season'])['avail'].first().sum(level=0)
df2.join(res.rename('sum_avail'))

      season  value  avail  test2  sum_avail
2020       1     -2      3      4         11
2020       1      3      3      5         11
2020       1      1      3      7         11
2020       2      5      8      8         11
2020       2      8      8      9         11
2021       2      6      4     10         29
2021       3      7     25     11         29
2021       3      5     25     12         29

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM