如何在 pandas 中按组获取类别百分比

Question

Apologies if something similar has been asked before, I searched around but couldn't figure out a solution.抱歉，如果以前有人问过类似的问题，我四处搜索但找不到解决方案。

My dataset looks like such我的数据集看起来像这样

data1 = {'Group':['Winner','Winner','Winner','Loser','Loser','Loser'],
        'MathStudy': ['Read','Read','Notes','Cheat','Cheat','Read'],
        'ScienceStudy': ['Notes','Read','Cheat','Cheat','Read','Notes']}
df1 = pd.DataFrame(data=data1)

I would like to get a % of total for each category for each group, as shown below.我想获得每个组的每个类别的总数百分比，如下所示。 In my dataset the number of winners and losers changes, so a flexible solution is appreciated.在我的数据集中，赢家和输家的数量发生了变化，因此非常感谢灵活的解决方案。

Thank you in advance!先感谢您！

Answer 1

Use DataFrame.melt with crosstab and normalize parameter:使用DataFrame.melt与crosstab和normalize参数：

df1 = df1.melt('Group', var_name='Type')

df2 = pd.crosstab([df1['Group'], df1['Type']], df1['value'], normalize=0)
print (df2)
value                   Cheat     Notes      Read
Group  Type                                      
Loser  MathStudy     0.666667  0.000000  0.333333
       ScienceStudy  0.333333  0.333333  0.333333
Winner MathStudy     0.000000  0.333333  0.666667
       ScienceStudy  0.333333  0.333333  0.333333

Last if need MultiIndex to columns with remove value column name add DataFrame.rename_axis with DataFrame.reset_index :最后，如果需要MultiIndex到具有删除value列名称的列，请添加DataFrame.rename_axis和DataFrame.reset_index ：

df2 = df2.rename_axis(columns=None).reset_index()
print (df2)
    Group          Type     Cheat     Notes      Read
0   Loser     MathStudy  0.666667  0.000000  0.333333
1   Loser  ScienceStudy  0.333333  0.333333  0.333333
2  Winner     MathStudy  0.000000  0.333333  0.666667
3  Winner  ScienceStudy  0.333333  0.333333  0.333333

Answer 2

@jezrael's solution is intuitive and what I would do first hand. @jezrael 的解决方案很直观，我会直接做。 However, I recently learned that melt usually performs poorly.然而，我最近了解到， melt通常表现不佳。 Here's an alternative if performance is important, eg in codes that are used repeatedly:如果性能很重要，例如在重复使用的代码中，这是一个替代方案：

g = df1.groupby('Group')
cols = ['MathStudy', 'ScienceStudy']
out = (pd.concat({col:g[col].value_counts(normalize=True) for col in cols})
   .unstack(level=-1, fill_value=0)
)

with run time:运行时间：

2.9 ms ± 96.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Compared to melt approach:与melt方法相比：

9.44 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Output: Output：

                        Cheat     Notes      Read
MathStudy    Loser   0.666667  0.000000  0.333333
             Winner  0.000000  0.333333  0.666667
ScienceStudy Loser   0.333333  0.333333  0.333333
             Winner  0.333333  0.333333  0.333333

Note : pd.crosstab is essentially groupby() with some additional bookkeeping.注意： pd.crosstab本质上是groupby()并带有一些额外的簿记。 And groupby on two columns are usually a lot slower.两列上的groupby通常要慢得多。

Answer 3

Here is another alternative:这是另一种选择：

g = df.set_index('Group').stack().str.get_dummies().groupby(level=[0,1]).sum()
g.div(g.sum(axis=1),axis=0).round(2)

如何在 pandas 中按组获取类别百分比

问题描述

3 个解决方案

解决方案1
4 已采纳 2021-02-26 21:06:41

解决方案2
4 2021-02-26 21:22:08

解决方案3
2 2021-02-26 21:28:07

如何在 pandas 中按组获取类别百分比

问题描述

3 个解决方案

解决方案1 4 已采纳 2021-02-26 21:06:41

解决方案2 4 2021-02-26 21:22:08

解决方案3 2 2021-02-26 21:28:07

解决方案1
4 已采纳 2021-02-26 21:06:41

解决方案2
4 2021-02-26 21:22:08

解决方案3
2 2021-02-26 21:28:07