简体   繁体   中英

get the percentage of each class pandas dataframe

my dataframe:

df = pd.DataFrame({'label':[0,0,1,1,2,2],"gender":['M','F','M','F','M','F'],'count':[100,200,150,210,300,220]})

I tried:

df['percent'] = 100* df['count'] / df.groupby('label')['count'].sum()

But it gives only for top 4 rows still wrong values.

I need a column with the name percent percentage of count based on label

Output:

    label   gender  count percent
0   0       M       100   33
1   0       F       200   67
2   1       M       150   41
3   1       F       210   59
4   2       M       300   57
5   2       F       220   43

Use GroupBy.transform for repeat aggregate values for Series with same size like original column, so possible divide:

df['percent'] = 100* df['count'] / df.groupby('label')['count'].transform('sum')
print (df)
   label gender  count    percent
0      0      M    100  33.333333
1      0      F    200  66.666667
2      1      M    150  41.666667
3      1      F    210  58.333333
4      2      M    300  57.692308
5      2      F    220  42.307692

Detail :

print (df.groupby('label')['count'].transform('sum'))
0    300
1    300
2    360
3    360
4    520
5    520
Name: count, dtype: int64

Another option, though more verbose than transform , is to use map :

mapping = df.label.map(df.groupby("label")["count"].sum())
df.assign(percent=df["count"].mul(100).div(mapping))

    label   gender  count   percent
0   0   M   100     33.333333
1   0   F   200     66.666667
2   1   M   150     41.666667
3   1   F   210     58.333333
4   2   M   300     57.692308
5   2   F   220     42.307692

One more option is to use set_index / reset_index trick.

d = df.set_index(['label', 'gender'])
d.mul(100).div(d.sum(level=0)).reset_index()

   label gender      count
0      0      M  33.333333
1      0      F  66.666667
2      1      M  41.666667
3      1      F  58.333333
4      2      M  57.692308
5      2      F  42.307692

For string presentation, you can use python's mini string language here.

d = df.set_index(['label', 'gender'])
d.div(d.sum(level=0))['count'].map("{:%}".format).reset_index()

   label gender       count
0      0      M  33.333333%
1      0      F  66.666667%
2      1      M  41.666667%
3      1      F  58.333333%
4      2      M  57.692308%
5      2      F  42.307692%

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM