[英]How to get percent of category by group in pandas
Apologies if something similar has been asked before, I searched around but couldn't figure out a solution.抱歉,如果以前有人问过类似的问题,我四处搜索但找不到解决方案。
My dataset looks like such我的数据集看起来像这样
data1 = {'Group':['Winner','Winner','Winner','Loser','Loser','Loser'],
'MathStudy': ['Read','Read','Notes','Cheat','Cheat','Read'],
'ScienceStudy': ['Notes','Read','Cheat','Cheat','Read','Notes']}
df1 = pd.DataFrame(data=data1)
I would like to get a % of total for each category for each group, as shown below.我想获得每个组的每个类别的总数百分比,如下所示。 In my dataset the number of winners and losers changes, so a flexible solution is appreciated.在我的数据集中,赢家和输家的数量发生了变化,因此非常感谢灵活的解决方案。
Thank you in advance!先感谢您!
Use DataFrame.melt
with crosstab
and normalize
parameter:使用DataFrame.melt
与crosstab
和normalize
参数:
df1 = df1.melt('Group', var_name='Type')
df2 = pd.crosstab([df1['Group'], df1['Type']], df1['value'], normalize=0)
print (df2)
value Cheat Notes Read
Group Type
Loser MathStudy 0.666667 0.000000 0.333333
ScienceStudy 0.333333 0.333333 0.333333
Winner MathStudy 0.000000 0.333333 0.666667
ScienceStudy 0.333333 0.333333 0.333333
Last if need MultiIndex
to columns with remove value
column name add DataFrame.rename_axis
with DataFrame.reset_index
:最后,如果需要MultiIndex
到具有删除value
列名称的列,请添加DataFrame.rename_axis
和DataFrame.reset_index
:
df2 = df2.rename_axis(columns=None).reset_index()
print (df2)
Group Type Cheat Notes Read
0 Loser MathStudy 0.666667 0.000000 0.333333
1 Loser ScienceStudy 0.333333 0.333333 0.333333
2 Winner MathStudy 0.000000 0.333333 0.666667
3 Winner ScienceStudy 0.333333 0.333333 0.333333
@jezrael's solution is intuitive and what I would do first hand. @jezrael 的解决方案很直观,我会直接做。 However, I recently learned that melt
usually performs poorly.然而,我最近了解到, melt
通常表现不佳。 Here's an alternative if performance is important, eg in codes that are used repeatedly:如果性能很重要,例如在重复使用的代码中,这是一个替代方案:
g = df1.groupby('Group')
cols = ['MathStudy', 'ScienceStudy']
out = (pd.concat({col:g[col].value_counts(normalize=True) for col in cols})
.unstack(level=-1, fill_value=0)
)
with run time:运行时间:
2.9 ms ± 96.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Compared to melt
approach:与melt
方法相比:
9.44 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Output: Output:
Cheat Notes Read
MathStudy Loser 0.666667 0.000000 0.333333
Winner 0.000000 0.333333 0.666667
ScienceStudy Loser 0.333333 0.333333 0.333333
Winner 0.333333 0.333333 0.333333
Note : pd.crosstab
is essentially groupby()
with some additional bookkeeping.注意: pd.crosstab
本质上是groupby()
并带有一些额外的簿记。 And groupby
on two columns are usually a lot slower.两列上的groupby
通常要慢得多。
Here is another alternative:这是另一种选择:
g = df.set_index('Group').stack().str.get_dummies().groupby(level=[0,1]).sum()
g.div(g.sum(axis=1),axis=0).round(2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.