I have the foll. dataframe in pandas:
df
DAY YEAR REGION VALUE
1 2000 A 12
2 2000 A 10
3 2000 A 13
6 2000 A 15
1 2001 A 3
2 2001 A 40
3 2001 A 83
4 2001 A 95
1 2000 B 124
3 2000 B 102
5 2000 B 131
8 2000 B 150
1 2001 B 30
5 2001 B 4
8 2001 B 8
9 2001 B 12
I would like to create a new data frame such that each row contains a distinct combination of YEAR and REGION. It also contains a column which sums up the VALUE for that YEAR, REGION combination and another column which provides the maximum VALUE for the YEAR, REGION combination. The result should look like:
YEAR REGION SUM_VALUE MAX_VALUE
2000 A 50 15
2001 A 221 95
2000 B 507 150
2001 B 54 30
Here is what I am doing:
new_df = pandas.DataFrame()
for yr in df.YEAR.unique():
for reg in df.REGION.unique():
new_df = new_df.append({'YEAR': yr}, ignore_index=True)
new_df = new_df.append({'REGION: reg}, ignore_index=True)
However, this creates a new row each time, and is not very pythonic due to the xtra for loops. Is there a better way to proceed?
Please note that this is a toy dataframe, the actual dataframe has several VALUE columns. The proposed solution should scale, without having to manually specify the names of the VALUE columns.
groupby
on 'YEAR' and 'REGION' and pass a list of funcs to call using agg
:
In [9]:
df.groupby(['YEAR','REGION'])['VALUE'].agg(['sum','max']).reset_index()
Out[9]:
YEAR REGION sum max
0 2000 A 50 15
1 2000 B 507 150
2 2001 A 221 95
3 2001 B 54 30
EDIT :
If you want to name the aggregated columns, pass a dict:
In [18]:
df.groupby(['YEAR','REGION'])['VALUE'].agg({'sum_VALUE':'sum','max_VALUE':'max'}).reset_index()
Out[18]:
YEAR REGION max_VALUE sum_VALUE
0 2000 A 15 50
1 2000 B 150 507
2 2001 A 95 221
3 2001 B 30 54
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.