简体   繁体   中英

pandas groupby to calculate percentage of groupby columns

I want to calculate the rate_death percentage as below - (new_deaths / population) * 100 after grouping by location and summing new_deaths.

Example: for Afghanistan, rate_death must calculate as ((1+4+10) / 38928341) * 100 And for Albania, it must calculate as ((0+0+1) / 2877800) * 100

Below is the data and approaches which I tried but not working -

df_data
    location    date        new_cases   new_deaths  population
0   Afghanistan 4/25/2020   70          1           38928341
1   Afghanistan 4/26/2020   112         4           38928341
2   Afghanistan 4/27/2020   68          10          38928341
3   Albania    4/25/2020    15          0           2877800
4   Albania    4/26/2020    34          0           2877800
5   Albania    4/27/2020    14          1           2877800
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   location    6 non-null      object
 1   date        6 non-null      object
 2   new_cases   6 non-null      int64 
 3   new_deaths  6 non-null      int64 
 4   population  6 non-null      int64 

Approach 1:

df_res = df_data[['location','new_deaths','population']].groupby(['location']).sum()
location    new_deaths  population      
Afghanistan 15          116785023
Albania     1           8633400
df_res['rate_death'] = (df_res['new_deaths'] / df_res['population'] * 100.0)
location    new_deaths  population  rate_death          
Afghanistan 15          116785023   0.000
Albania     1           8633400     0.000

I know that the population is summing up twice due to the above groupby with 'sum' operation, but still I wonder why is the rate_death not calculating the percentage as expected but rather showing as 0.000

Approach 2: (tried as mentioned in this post - Pandas percentage of total with groupby )

location_population = df_data.groupby(['location', 'population']).agg({'new_deaths': 'sum'})
location = df_data.groupby(['location']).agg({'population': 'mean'})
location_population.div(location, level='location') * 100
location    population  new_deaths  population
        
Afghanistan 38928341    NaN         NaN
Albania     2877800     NaN         NaN

But it is coming as NaN.

Please help if anything wrong in these approaches or how to resolve this. Thanks!

You can do -

df = df.groupby(['location']).agg({'new_deaths': sum, 'population': max})
df['rate_death'] = df['new_deaths'] / df['population'] * 100

Result

             new_deaths  population  rate_death
location
Afghanistan          15    38928341    0.000039
Albania               1     2877800    0.000035

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM