简体   繁体   English

熊猫-计算组大小的百分比

[英]Pandas - count percentage of group size

Say, I have the data like this: 说,我有这样的数据:

col1   col2 other columns..
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      0    ...
0      1    ...
0      1    ...
0      1    ...
0      1    ...
0      1    ...
0      1    ...
1      0    ...
1      0    ...
etc...

Data has been grouped by 2 columns (it's already result by grouping): 数据已按2列分组(已经通过分组进行了分组):

gr = df.groupby(['col1', 'col2']).size()

col1   col2        
0      0           10
       1           5
1      0           2
       1           16
2      0           10

So now I need to figure out which percentage of each subgroup the count has respectively the whole group by 2 columns: 因此,现在我需要通过2列来确定计数分别占整个子组的哪个百分比:

I need to add one more column, or transform to Series (better) to have a percentage of col2 respectively the group (col1) like: 我需要再添加一列,或转换为“系列”(更好)以使col2的百分比分别等于组(col1),例如:

       col1        col2
0      0           0.66
       1           0.33
1      0           0.1
       1           0.9
2      0           1

Or it can be separate Series for each group: [0.66 0.1 1] and [0.33 0.9] . 或者,每个组可以是单独的序列: [0.66 0.1 1][0.33 0.9] How to implement it? 如何执行呢?

So let me describe the meaning of this data. 因此,让我描述一下这些数据的含义。 For example it can be subjects (0,1,2), results (0 or 1) and amount students per subject per result. 例如,可以是科目(0,1,2),结果(0或1)以及每个结果每个科目的学生人数。 So the whole idea is to figure out what percentage of students failed/passed for subjects 0,1, etc. 因此,整个想法是找出0、1等科目失败/及格的学生百分比。

One more additional thing - sometimes there is only one result (0 or 1) like the subject that all students passed, and I still need to be able to tell that for this subject percentage of 0 is 0,0 and of 1 is 1. 另外还有一件事-有时只有一个结果(0或1),就像所有学生通过的科目一样,我仍然需要知道该科目的百分比0为0,0,1为1。

You need groupby by first level of index with sum : 您需要按第一级索引和sum groupby

gr = df.groupby(['col1', 'col2']).size()
print (gr)
col1  col2
0     0       10
      1        5
1     0        2
      1       16
2     0       10
dtype: int64

print (gr.groupby(level=0).sum())
col1
0    15
1    18
2    10
dtype: int64

print (gr / gr.groupby(level=0).sum())
col1  col2
0     0       0.666667
      1       0.333333
1     0       0.111111
      1       0.888889
2     0       1.000000
dtype: float64

For storing Series use dict comprehension: 要存储Series使用dict理解:

dfs = {i:g.reset_index(drop=True) for i, g in g1.groupby(level=1)}

print (dfs[0])
0    0.666667
1    0.111111
2    1.000000
dtype: float64

print (dfs[1])
0    0.333333
1    0.888889
dtype: float64

You might be able to try this: 您可能可以尝试以下操作:

df = pd.DataFrame({'A':[0,1,0,1,0],'B':[10,5,2,16,10]}, index=[0,1,0,1,0])
df2 = df.ix[0] / df.ix[0].sum()
df3 = df.ix[1] / df.ix[1].sum()

Hope this will help. 希望这会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM