简体   繁体   English

如何在 pandas 中按组获取类别百分比

[英]How to get percent of category by group in pandas

Apologies if something similar has been asked before, I searched around but couldn't figure out a solution.抱歉,如果以前有人问过类似的问题,我四处搜索但找不到解决方案。

My dataset looks like such我的数据集看起来像这样

data1 = {'Group':['Winner','Winner','Winner','Loser','Loser','Loser'],
        'MathStudy': ['Read','Read','Notes','Cheat','Cheat','Read'],
        'ScienceStudy': ['Notes','Read','Cheat','Cheat','Read','Notes']}
df1 = pd.DataFrame(data=data1)

在此处输入图像描述

I would like to get a % of total for each category for each group, as shown below.我想获得每个组的每个类别的总数百分比,如下所示。 In my dataset the number of winners and losers changes, so a flexible solution is appreciated.在我的数据集中,赢家和输家的数量发生了变化,因此非常感谢灵活的解决方案。 在此处输入图像描述

Thank you in advance!先感谢您!

Use DataFrame.melt with crosstab and normalize parameter:使用DataFrame.meltcrosstabnormalize参数:

df1 = df1.melt('Group', var_name='Type')

df2 = pd.crosstab([df1['Group'], df1['Type']], df1['value'], normalize=0)
print (df2)
value                   Cheat     Notes      Read
Group  Type                                      
Loser  MathStudy     0.666667  0.000000  0.333333
       ScienceStudy  0.333333  0.333333  0.333333
Winner MathStudy     0.000000  0.333333  0.666667
       ScienceStudy  0.333333  0.333333  0.333333
 

Last if need MultiIndex to columns with remove value column name add DataFrame.rename_axis with DataFrame.reset_index :最后,如果需要MultiIndex到具有删除value列名称的列,请添加DataFrame.rename_axisDataFrame.reset_index

df2 = df2.rename_axis(columns=None).reset_index()
print (df2)
    Group          Type     Cheat     Notes      Read
0   Loser     MathStudy  0.666667  0.000000  0.333333
1   Loser  ScienceStudy  0.333333  0.333333  0.333333
2  Winner     MathStudy  0.000000  0.333333  0.666667
3  Winner  ScienceStudy  0.333333  0.333333  0.333333

@jezrael's solution is intuitive and what I would do first hand. @jezrael 的解决方案很直观,我会直接做。 However, I recently learned that melt usually performs poorly.然而,我最近了解到, melt通常表现不佳。 Here's an alternative if performance is important, eg in codes that are used repeatedly:如果性能很重要,例如在重复使用的代码中,这是一个替代方案:

g = df1.groupby('Group')
cols = ['MathStudy', 'ScienceStudy']
out = (pd.concat({col:g[col].value_counts(normalize=True) for col in cols})
   .unstack(level=-1, fill_value=0)
)

with run time:运行时间:

2.9 ms ± 96.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Compared to melt approach:melt方法相比:

9.44 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Output: Output:

                        Cheat     Notes      Read
MathStudy    Loser   0.666667  0.000000  0.333333
             Winner  0.000000  0.333333  0.666667
ScienceStudy Loser   0.333333  0.333333  0.333333
             Winner  0.333333  0.333333  0.333333

Note : pd.crosstab is essentially groupby() with some additional bookkeeping.注意pd.crosstab本质上是groupby()并带有一些额外的簿记。 And groupby on two columns are usually a lot slower.两列上的groupby通常要慢得多。

Here is another alternative:这是另一种选择:

g = df.set_index('Group').stack().str.get_dummies().groupby(level=[0,1]).sum()
g.div(g.sum(axis=1),axis=0).round(2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM