简体   繁体   中英

Pandas multiple index with multiple aggregate functions

With this sample data frame data:

+------+--------+------+-------+------+--------+
| NAME |  JOB   | YEAR | MONTH | DAYS | SALARY |
+------+--------+------+-------+------+--------+
| Bob  | Worker | 2013 |    12 |    3 |     17 |
| Mary | Employ | 2013 |    12 |    5 |     23 |
| Bob  | Worker | 2014 |     1 |   10 |    100 |
| Bob  | Worker | 2014 |     1 |   11 |    110 |
| Mary | Employ | 2014 |     1 |   15 |    200 |
| Bob  | Worker | 2014 |     2 |    8 |     80 |
| Mary | Employ | 2014 |     2 |    5 |    190 |
+------+--------+------+-------+------+--------+

Is there an easy way to obtain an output like this without manually create all the pivot parts?

index=JOB,MAX(YEAR),NAME,SUM(DAYS)  
columns=MONTH  
values=SUM(SALARY)

                                +-----------+-------------+-------------+
                                |     MONTH |           1 |           2 |
    +--------+-----------+------+-----------+-------------+-------------+
    |  JOB   | MAX(YEAR) | NAME | SUM(DAYS) | SUM(SALARY) | SUM(SALARY) |
    +--------+-----------+------+-----------+-------------+-------------+
    | Employ |      2014 | Mary |        29 |         210 |         190 |
    | Worker |      2014 | Bob  |        20 |         200 |          80 |
    +--------+-----------+------+-----------+-------------+-------------+

Starting from:

In [179]: df
Out[179]: 
   NAME     JOB  YEAR  MONTH  DAYS  SALARY
0   Bob  Worker  2013     12     3      17
1  Mary  Employ  2013     12     5      23
2   Bob  Worker  2014      1    10     100
3   Bob  Worker  2014      1    11     110
4  Mary  Employ  2014      1    15     200
5   Bob  Worker  2014      2     8      80
6  Mary  Employ  2014      2     5     190

we can get most of the data we want using

result = df.groupby(['JOB', 'NAME', 'MONTH', 'YEAR']).sum().reset_index(['MONTH'])

#                   MONTH  DAYS  SALARY
# JOB    NAME YEAR                     
# Employ Mary 2014      1    15     200
#             2014      2     5     190
#             2013     12     5      23
# Worker Bob  2014      1    21     210
#             2014      2     8      80
#             2013     12     3      17

To this we add the sum of the days:

total_days = df.groupby(['JOB', 'NAME', 'YEAR'])[['DAYS']].sum()
total_days.columns = ['SUM(DAYS)']

#                   SUM(DAYS)
# JOB    NAME YEAR           
# Employ Mary 2013          5
#             2014         20
# Worker Bob  2013          3
#             2014         29

result = result.join(total_days)
del result['DAYS']
#                   MONTH  SALARY  SUM(DAYS)
# JOB    NAME YEAR                          
# Employ Mary 2013     12      23          5
#             2014      1     200         20
#             2014      2     190         20
# Worker Bob  2013     12      17          3
#             2014      1     210         29
#             2014      2      80         29

To select the rows associated with the max(YEAR) , we compute

max_year = df.groupby(['JOB', 'NAME'])[['YEAR']].max()
max_year = max_year.set_index('YEAR', drop=False, append=True)

#                   YEAR
# JOB    NAME YEAR      
# Employ Mary 2014  2014
# Worker Bob  2014  2014

so the selection can be expressed as a left join:

result = max_year.join(result)
del result['YEAR']

#                   MONTH  SALARY  SUM(DAYS)
# JOB    NAME YEAR                          
# Employ Mary 2014      1     200         20
#             2014      2     190         20
# Worker Bob  2014      1     210         29
#             2014      2      80         29

Now we can move MONTH into a hierarchical column level like this:

result = result.set_index(['SUM(DAYS)', 'MONTH'], append=True)
result = result.unstack('MONTH')
result = result.reset_index(['SUM(DAYS)'])

which yields

                  SUM(DAYS)  SALARY     
MONTH                             1    2
JOB    NAME YEAR                        
Employ Mary 2014         20     200  190
Worker Bob  2014         29     210   80

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM