I have the below data:
id hours class
1 67.91 V
1 65.56 V
1 51.14 V
1 41.51 V
1 33.55 V
1 26.45 G
1 26.09 V
1 25.77 G
1 25.50 P
1 25.13 G
1 24.49 P
1 21.88 B
1 18.57 V
1 17.90 B
...
18 92.2 B
18 81.06 V
18 70.48 V
18 67.10 B
18 62.92 B
18 62.88 V
18 54.36 B
18 52.77 V
18 44.55 V
18 40.61 P
18 40.51 P
18 40.06 V
18 37.67 V
18 33.78 B
I essentially need to get the data in pivot format and calculate the total hours within each class as a percentage of the total hours for each household in the data:
Expected Output:
id B G P V Total
1 8.44% 16.41% 10.60% 64.55% 100.00%
18 39.74% 0.0% 10.39% 49.87% 100.00%
Can someone please help me with this? This has to be done id/row wise. The data is in a pandas data-frame.
I believe you need groupby
+ sum
+ unstack
or pivot_table
for pivoting:
df = df.groupby(['id','class'])['hours'].sum().unstack(fill_value=0)
df = df.pivot_table(index='id', columns='class', values='hours', aggfunc='sum', fill_value=0)
And then divide by div
sum per rows, multiple by 100
, round
and last add new column Total
by assign
with check if get 100
, thanks Paul H
for idea:
df = df.div(df.sum(1), 0).mul(100).round(2).assign(Total=lambda df: df.sum(axis=1))
print (df)
class B G P V Total
id
1 8.44 16.41 10.60 64.55 100.0
18 39.74 0.00 10.39 49.87 100.0
And for percentage convert to string
s and add %
:
df1 = df.astype(str) + '%'
print (df1)
class B G P V Total
id
1 8.44% 16.41% 10.6% 64.55% 100.0%
18 39.74% 0.0% 10.39% 49.87% 100.0%
Timings :
np.random.seed(123)
N = 100000
L = list('BGPV')
df = pd.DataFrame({'class': np.random.choice(L, N),
'hours':np.random.rand(N),
'id':np.random.randint(20000, size=N)})
print (df)
def dark1(df):
ndf = df.groupby('id').apply(lambda x : x.groupby('class')['hours'].sum()/x['hours'].sum())\
.reset_index().pivot(columns='class',index='id')*100
return ndf.assign(Total=ndf.sum(1)).fillna(0)
def dark2(df):
one = df.groupby('id')['hours'].sum()
two = df.pivot_table(index='id',values='hours',columns='class',aggfunc=sum)
ndf = pd.DataFrame(two.values / one.values[:,None]*100,columns=two.columns)
return ndf.assign(Total=ndf.sum(1)).fillna(0)
def jez1(df):
df = df.groupby(['id','class'])['hours'].sum().unstack(fill_value=0)
return df.div(df.sum(1), 0).mul(100).assign(Total=lambda df: df.sum(axis=1))
def jez2(df):
df = df.pivot_table(index='id', columns='class', values='hours', aggfunc='sum', fill_value=0)
return df.div(df.sum(1), 0).mul(100).assign(Total=lambda df: df.sum(axis=1))
print (dark1(df))
print (dark2(df))
print (jez1(df))
print (jez2(df))
In [39]: %timeit (dark1(df))
1 loop, best of 3: 15.4 s per loop
In [40]: %timeit (dark2(df))
10 loops, best of 3: 52.7 ms per loop
In [41]: %timeit (jez1(df))
10 loops, best of 3: 38.8 ms per loop
In [42]: %timeit (jez2(df))
10 loops, best of 3: 44.9 ms per loop
Caveat
The results do not address performance given the number of groups, which will affect timings for some of these solutions.
Another way is to use nested groupby
ie
ndf = df.groupby('id').apply(lambda x : x.groupby('class')['hours'].sum()/x['hours'].sum())\
.reset_index().pivot(columns='class',index='id')*100
ndf = ndf.assign(Total=ndf.sum(1)).fillna(0)
hours Total
class B G P V
id
1 8.437798 16.40683 10.603457 64.551914 100.0
18 39.741341 0 10.387349 49.871311 100.0
Or :
one = df.groupby('id')['hours'].sum()
two = df.pivot_table(index='id',values='hours',columns='class',aggfunc=sum)
ndf = pd.DataFrame(two.values / one.values[:,None]*100,columns=two.columns)
ndf = ndf.assign(Total=ndf.sum(1)).fillna(0)
class B G P V Total
0 8.437798 16.40683 10.603457 64.551914 100.0
1 39.741341 0.00000 10.387349 49.871311 100.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.