简体   繁体   中英

How to calculate a percentage using two groups within groupby?

Given the following dataframe:

+------+-----+-----+
| Year | Cat | Bin |
+------+-----+-----+
| 2000 | A   |   0 |
| 2000 | A   |   1 |
| 2001 | A   |   0 |
| 2001 | B   |   1 |
| 2001 | B   |   0 |
| 2001 | B   |   1 |
+------+-----+-----+

d = {
    'year': [2000, 2000, 2001, 2001, 2001, 2001],
    'cat': ["A", "A", "A", "B", "B", "B", ],
    'bin': [0, 1, 0, 1, 0, 1],
}
df = pd.DataFrame(data=d)

I want to create the following table:

+------+-----+------+-------+------+
| year | cat | mean | count | pct  |
+------+-----+------+-------+------+
| 2000 | A   |  0.5 |     2 | 100% |
| 2001 | A   |    0 |     1 | 25%  |
| 2001 | B   | 0.67 |     3 | 75%  |
+------+-----+------+-------+------+

Where pct is the percentage count by cat & year is of count by year .

I've got as far as the first two columns with:

df["count"] = 1
df_groupby = df.groupby(["year", "cat"]).agg({"bin": "mean", "count": "sum"})
df_groupby.rename(columns={"bin": "mean"}, inplace=True)

But I can't work out how to create the third column?

Group on year and cat to calculate mean and count , then calculate the counts in year column using value_counts and divide this by the counts per year and cat to calculate the percentage

s = df.groupby(['year', 'cat'])['bin'].agg(['mean', 'count'])
s['pct'] = s['count'].div(df['year'].value_counts(), level=0, axis=0).mul(100)

              mean  count    pct
year cat                        
2000 A    0.500000      2  100.0
2001 A    0.000000      1   25.0
     B    0.666667      3   75.0

Use SeriesGroupBy.value_counts and add new column by DataFrame.join :

s = df.groupby("year")['cat'].value_counts(normalize=True).mul(100)
df1 = df.groupby(["year", "cat"]).agg(mean = ("bin", "mean"),
                                      count = ("bin", "count")).join(s.rename('pct'))

print (df1)
              mean  count    pct
year cat                        
2000 A    0.500000      2  100.0
2001 A    0.000000      1   25.0
     B    0.666667      3   75.0

Or by assign :

s = df.groupby("year")['cat'].value_counts(normalize=True).mul(100)
df1 = df.groupby(["year", "cat"]).agg(mean = ("bin", "mean"),
                                      count = ("bin", "count")).assign(pct = s)

To get the percentage, after getting mean and count , divide the intial count by the sum of the count grouped by year:

(df.groupby(['year', 'cat'])
   .agg(['mean', 'count'])
   .droplevel(0, 1)
   .assign(pct = lambda df: df['count'].div(df.groupby('year')['count']
                                              .transform('sum'))
                                              .mul(100)
                                           )
 )


              mean  count    pct
year cat                        
2000 A    0.500000      2  100.0
2001 A    0.000000      1   25.0
     B    0.666667      3   75.0

Note that for clarity, I'd suggest you do this in separate steps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM