简体   繁体   中英

How to get the percentage of each value in a row basis row total in python

I have the below data:

id  hours       class
1   67.91       V
1   65.56       V
1   51.14       V
1   41.51       V
1   33.55       V
1   26.45       G
1   26.09       V
1   25.77       G
1   25.50       P
1   25.13       G
1   24.49       P
1   21.88       B
1   18.57       V
1   17.90       B
...

18  92.2        B
18  81.06       V
18  70.48       V
18  67.10       B
18  62.92       B
18  62.88       V
18  54.36       B
18  52.77       V
18  44.55       V
18  40.61       P
18  40.51       P
18  40.06       V
18  37.67       V
18  33.78       B

I essentially need to get the data in pivot format and calculate the total hours within each class as a percentage of the total hours for each household in the data:

Expected Output:

id  B       G       P       V       Total
1   8.44%   16.41%  10.60%  64.55%  100.00%
18  39.74%  0.0%    10.39%  49.87%  100.00%

Can someone please help me with this? This has to be done id/row wise. The data is in a pandas data-frame.

I believe you need groupby + sum + unstack or pivot_table for pivoting:

df = df.groupby(['id','class'])['hours'].sum().unstack(fill_value=0)

df = df.pivot_table(index='id', columns='class', values='hours', aggfunc='sum', fill_value=0)

And then divide by div sum per rows, multiple by 100 , round and last add new column Total by assign with check if get 100 , thanks Paul H for idea:

df = df.div(df.sum(1), 0).mul(100).round(2).assign(Total=lambda df: df.sum(axis=1))
print (df)
class      B      G      P      V  Total
id                                      
1       8.44  16.41  10.60  64.55  100.0
18     39.74   0.00  10.39  49.87  100.0

And for percentage convert to string s and add % :

df1 = df.astype(str) + '%'
print (df1)
class       B       G       P       V   Total
id                                           
1       8.44%  16.41%   10.6%  64.55%  100.0%
18     39.74%    0.0%  10.39%  49.87%  100.0%

Timings :

np.random.seed(123)
N = 100000
L = list('BGPV')

df = pd.DataFrame({'class': np.random.choice(L, N),
                   'hours':np.random.rand(N),
                   'id':np.random.randint(20000, size=N)})
print (df)


def dark1(df):
    ndf = df.groupby('id').apply(lambda x : x.groupby('class')['hours'].sum()/x['hours'].sum())\
                          .reset_index().pivot(columns='class',index='id')*100
    return ndf.assign(Total=ndf.sum(1)).fillna(0)

def dark2(df):
    one =  df.groupby('id')['hours'].sum()
    two = df.pivot_table(index='id',values='hours',columns='class',aggfunc=sum)

    ndf = pd.DataFrame(two.values / one.values[:,None]*100,columns=two.columns)
    return ndf.assign(Total=ndf.sum(1)).fillna(0)

def jez1(df):
    df = df.groupby(['id','class'])['hours'].sum().unstack(fill_value=0)
    return df.div(df.sum(1), 0).mul(100).assign(Total=lambda df: df.sum(axis=1))

def jez2(df):
    df = df.pivot_table(index='id', columns='class', values='hours', aggfunc='sum', fill_value=0)
    return df.div(df.sum(1), 0).mul(100).assign(Total=lambda df: df.sum(axis=1))

print (dark1(df))
print (dark2(df))
print (jez1(df))
print (jez2(df))

In [39]: %timeit (dark1(df))
1 loop, best of 3: 15.4 s per loop

In [40]: %timeit (dark2(df))
10 loops, best of 3: 52.7 ms per loop

In [41]: %timeit (jez1(df))
10 loops, best of 3: 38.8 ms per loop

In [42]: %timeit (jez2(df))
10 loops, best of 3: 44.9 ms per loop

Caveat

The results do not address performance given the number of groups, which will affect timings for some of these solutions.

Another way is to use nested groupby ie

ndf = df.groupby('id').apply(lambda x : x.groupby('class')['hours'].sum()/x['hours'].sum())\
                      .reset_index().pivot(columns='class',index='id')*100
ndf = ndf.assign(Total=ndf.sum(1)).fillna(0)

           hours                                  Total
class          B         G          P          V       
id                                                     
1       8.437798  16.40683  10.603457  64.551914  100.0
18     39.741341         0  10.387349  49.871311  100.0

Or :

one =  df.groupby('id')['hours'].sum()
two = df.pivot_table(index='id',values='hours',columns='class',aggfunc=sum)

ndf = pd.DataFrame(two.values / one.values[:,None]*100,columns=two.columns)
ndf = ndf.assign(Total=ndf.sum(1)).fillna(0)

class          B         G          P          V  Total
0       8.437798  16.40683  10.603457  64.551914  100.0
1      39.741341   0.00000  10.387349  49.871311  100.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM