简体   繁体   中英

Pandas groupby and aggregate to new columns

Currently I'm trying to cast a column into several columns and sum its contents accordingly, ie tidying the dataframe in length. For example, we have a column named year with values from 2014 till 2016. Second, we also have the column sales with an amount. What I want is to cast year into 2014 , 2015 & 2016 with the sum of sales corresponding to that specific year. The original sales can be dropped or show a total sum of the sales over all years.

Using Pandas groupby() function, agg() and transform() I've tried to come up with a solution, with no prevail first , second . That is, I cannot seem to get a workaround to create the 2014 etc. columns.

Assume the following dataframe:

df = pd.DataFrame({'CustomerId':[1,1,1,2,2,2,3,3,3,4,4,4,5,5,5],
                   'CustomerName': ['McNulty','McNulty','McNulty',
                                    'Bunk','Bunk','Bunk',
                                    'Joe','Joe','Joe',
                                    'Rawls','Rawls','Rawls',
                                    'Davis','Davis','Davis'],
                  'Sales':np.random.randint(1000,1500,15),
                  'Year':[2014,2015,2016,2014,2015,2016,2014,2015,2016,
                         2014,2015,2016,2014,2015,2016]})

The expected output should be as follows:

CustomerId CustomerName Sales 2014 2015 2016
1          McNulty      3300  1050 1050 1200
2          Bunk         3500  1100 1200 1200
3          Joe          3900  1300 1300 1300
4          Rawls        3500  1000 1000 1500
5          Davis        3800  1600 1100 1100

You can use DataFrame.pivot_table :

df.pivot_table(index=['CustomerId', 'CustomerName'],
               columns=['Year'],
               values='Sales',
               margins=True,
               margins_name='Sales',
               aggfunc='sum').reset_index().iloc[:-1]

[out]

Year CustomerId CustomerName  2014  2015  2016  Sales
0             1      McNulty  1006  1325  1205   3536
1             2         Bunk  1267  1419  1257   3943
2             3          Joe  1348  1217  1323   3888
3             4        Rawls  1091  1390  1330   3811
4             5        Davis  1075  1316  1481   3872

Using pivot_table and flattening multiindex columns and finally calculating the sum over axis=1 :

piv = df.pivot_table(index=['CustomerId', 'CustomerName'], columns='Year').reset_index()

piv.columns = [f'{c1}_{c2}'.strip('_') for c1, c2 in piv.columns]

piv['Sales'] = piv.filter(like='Sales').sum(axis=1)

Output

   CustomerId CustomerName  Sales_2014  Sales_2015  Sales_2016  Sales
0           1      McNulty        1144        1007        1108   3259
1           2         Bunk        1146        1451        1169   3766
2           3          Joe        1455        1070        1351   3876
3           4        Rawls        1263        1004        1422   3689
4           5        Davis        1428        1431        1399   4258`

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM