Given the following dataframe:
+------+-----+-----+
| Year | Cat | Bin |
+------+-----+-----+
| 2000 | A | 0 |
| 2000 | A | 1 |
| 2001 | A | 0 |
| 2001 | B | 1 |
| 2001 | B | 0 |
| 2001 | B | 1 |
+------+-----+-----+
d = {
'year': [2000, 2000, 2001, 2001, 2001, 2001],
'cat': ["A", "A", "A", "B", "B", "B", ],
'bin': [0, 1, 0, 1, 0, 1],
}
df = pd.DataFrame(data=d)
I want to create the following table:
+------+-----+------+-------+------+
| year | cat | mean | count | pct |
+------+-----+------+-------+------+
| 2000 | A | 0.5 | 2 | 100% |
| 2001 | A | 0 | 1 | 25% |
| 2001 | B | 0.67 | 3 | 75% |
+------+-----+------+-------+------+
Where pct
is the percentage count
by cat
& year
is of count
by year
.
I've got as far as the first two columns with:
df["count"] = 1
df_groupby = df.groupby(["year", "cat"]).agg({"bin": "mean", "count": "sum"})
df_groupby.rename(columns={"bin": "mean"}, inplace=True)
But I can't work out how to create the third column?
Group on year
and cat
to calculate mean
and count
, then calculate the counts in year column using value_counts
and divide this by the counts per year
and cat
to calculate the percentage
s = df.groupby(['year', 'cat'])['bin'].agg(['mean', 'count'])
s['pct'] = s['count'].div(df['year'].value_counts(), level=0, axis=0).mul(100)
mean count pct
year cat
2000 A 0.500000 2 100.0
2001 A 0.000000 1 25.0
B 0.666667 3 75.0
Use SeriesGroupBy.value_counts
and add new column by DataFrame.join
:
s = df.groupby("year")['cat'].value_counts(normalize=True).mul(100)
df1 = df.groupby(["year", "cat"]).agg(mean = ("bin", "mean"),
count = ("bin", "count")).join(s.rename('pct'))
print (df1)
mean count pct
year cat
2000 A 0.500000 2 100.0
2001 A 0.000000 1 25.0
B 0.666667 3 75.0
Or by assign
:
s = df.groupby("year")['cat'].value_counts(normalize=True).mul(100)
df1 = df.groupby(["year", "cat"]).agg(mean = ("bin", "mean"),
count = ("bin", "count")).assign(pct = s)
To get the percentage, after getting mean
and count
, divide the intial count
by the sum of the count
grouped by year:
(df.groupby(['year', 'cat'])
.agg(['mean', 'count'])
.droplevel(0, 1)
.assign(pct = lambda df: df['count'].div(df.groupby('year')['count']
.transform('sum'))
.mul(100)
)
)
mean count pct
year cat
2000 A 0.500000 2 100.0
2001 A 0.000000 1 25.0
B 0.666667 3 75.0
Note that for clarity, I'd suggest you do this in separate steps.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.