简体   繁体   English

如何在python中获取行基础行中每个值的百分比

[英]How to get the percentage of each value in a row basis row total in python

I have the below data: 我有以下数据:

id  hours       class
1   67.91       V
1   65.56       V
1   51.14       V
1   41.51       V
1   33.55       V
1   26.45       G
1   26.09       V
1   25.77       G
1   25.50       P
1   25.13       G
1   24.49       P
1   21.88       B
1   18.57       V
1   17.90       B
...

18  92.2        B
18  81.06       V
18  70.48       V
18  67.10       B
18  62.92       B
18  62.88       V
18  54.36       B
18  52.77       V
18  44.55       V
18  40.61       P
18  40.51       P
18  40.06       V
18  37.67       V
18  33.78       B

I essentially need to get the data in pivot format and calculate the total hours within each class as a percentage of the total hours for each household in the data: 我本质上需要获取数据透视表格式的数据,并计算每个类别中的总工作时间占数据中每个家庭总工作时间的百分比:

Expected Output: 预期产量:

id  B       G       P       V       Total
1   8.44%   16.41%  10.60%  64.55%  100.00%
18  39.74%  0.0%    10.39%  49.87%  100.00%

Can someone please help me with this? 有人可以帮我吗? This has to be done id/row wise. 这必须在id / row明智的情况下完成。 The data is in a pandas data-frame. 数据在熊猫数据框中。

I believe you need groupby + sum + unstack or pivot_table for pivoting: 我相信你需要groupby + sum + unstackpivot_table为枢轴:

df = df.groupby(['id','class'])['hours'].sum().unstack(fill_value=0)

df = df.pivot_table(index='id', columns='class', values='hours', aggfunc='sum', fill_value=0)

And then divide by div sum per rows, multiple by 100 , round and last add new column Total by assign with check if get 100 , thanks Paul H for idea: 然后除以每行的div总和,再乘以100round ,最后添加新列Totalassign ,检查是否为100 ,谢谢Paul H的想法:

df = df.div(df.sum(1), 0).mul(100).round(2).assign(Total=lambda df: df.sum(axis=1))
print (df)
class      B      G      P      V  Total
id                                      
1       8.44  16.41  10.60  64.55  100.0
18     39.74   0.00  10.39  49.87  100.0

And for percentage convert to string s and add % : 对于百分比,请转换为string s并添加%

df1 = df.astype(str) + '%'
print (df1)
class       B       G       P       V   Total
id                                           
1       8.44%  16.41%   10.6%  64.55%  100.0%
18     39.74%    0.0%  10.39%  49.87%  100.0%

Timings : 时间

np.random.seed(123)
N = 100000
L = list('BGPV')

df = pd.DataFrame({'class': np.random.choice(L, N),
                   'hours':np.random.rand(N),
                   'id':np.random.randint(20000, size=N)})
print (df)


def dark1(df):
    ndf = df.groupby('id').apply(lambda x : x.groupby('class')['hours'].sum()/x['hours'].sum())\
                          .reset_index().pivot(columns='class',index='id')*100
    return ndf.assign(Total=ndf.sum(1)).fillna(0)

def dark2(df):
    one =  df.groupby('id')['hours'].sum()
    two = df.pivot_table(index='id',values='hours',columns='class',aggfunc=sum)

    ndf = pd.DataFrame(two.values / one.values[:,None]*100,columns=two.columns)
    return ndf.assign(Total=ndf.sum(1)).fillna(0)

def jez1(df):
    df = df.groupby(['id','class'])['hours'].sum().unstack(fill_value=0)
    return df.div(df.sum(1), 0).mul(100).assign(Total=lambda df: df.sum(axis=1))

def jez2(df):
    df = df.pivot_table(index='id', columns='class', values='hours', aggfunc='sum', fill_value=0)
    return df.div(df.sum(1), 0).mul(100).assign(Total=lambda df: df.sum(axis=1))

print (dark1(df))
print (dark2(df))
print (jez1(df))
print (jez2(df))

In [39]: %timeit (dark1(df))
1 loop, best of 3: 15.4 s per loop

In [40]: %timeit (dark2(df))
10 loops, best of 3: 52.7 ms per loop

In [41]: %timeit (jez1(df))
10 loops, best of 3: 38.8 ms per loop

In [42]: %timeit (jez2(df))
10 loops, best of 3: 44.9 ms per loop

Caveat 警告

The results do not address performance given the number of groups, which will affect timings for some of these solutions. 给定组数,结果无法解决性能问题,这将影响其中一些解决方案的时序。

Another way is to use nested groupby ie 另一种方法是使用nested groupby

ndf = df.groupby('id').apply(lambda x : x.groupby('class')['hours'].sum()/x['hours'].sum())\
                      .reset_index().pivot(columns='class',index='id')*100
ndf = ndf.assign(Total=ndf.sum(1)).fillna(0)

           hours                                  Total
class          B         G          P          V       
id                                                     
1       8.437798  16.40683  10.603457  64.551914  100.0
18     39.741341         0  10.387349  49.871311  100.0

Or : 要么 :

one =  df.groupby('id')['hours'].sum()
two = df.pivot_table(index='id',values='hours',columns='class',aggfunc=sum)

ndf = pd.DataFrame(two.values / one.values[:,None]*100,columns=two.columns)
ndf = ndf.assign(Total=ndf.sum(1)).fillna(0)

class          B         G          P          V  Total
0       8.437798  16.40683  10.603457  64.551914  100.0
1      39.741341   0.00000  10.387349  49.871311  100.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM