简体   繁体   中英

Pandas percentage of total row within multiindex

I have a dataframe that looks as follows:

df = pd.DataFrame([['Foo','A','Green',10,20],['Foo','A','Red',20,30],['Foo','A','Total',50,60],['Foo','B','Blue',5,10],['Foo','B','Red',15,25],['Foo','B','Total',40,100],['Foo','C','Orange',25,8],['Foo','C','Total',50,10]],columns = ['Default','Letter','Color','Value'])
print df

      Default Letter   Color  Value  Value2
0     Foo      A   Green     10      20
1     Foo      A     Red     20      30
2     Foo      A   Total     50      60
3     Foo      B    Blue      5      10
4     Foo      B     Red     15      25
5     Foo      B   Total     40     100
6     Foo      C  Orange     25       8
7     Foo      C   Total     50      10

I need to find the percentage of the total row that each color makes up within each group

My first thought was to split them into separate indexes, and use .div, but in this case I have a multiindex (I know in my example the first all says Foo, but that's not how the real data looks - roll with it.) and I get the notImplemented Error.

df_color = df[df['Color']!='Total'].set_index(['Default','Letter','Color'])
df_tot = df[df['Color']=='Total'].drop(['Color'],axis = 1).set_index(['Default','Letter'])

df_out = df_color.div(df_tot)

NotImplementedError                       Traceback (most recent call last)
<ipython-input-119-0caf0e2959a6> in <module>()
      4 df_tot = df[df['Color']=='Total'].drop(['Color'],axis = 1).set_index(['Default','Letter'])
      5 
----> 6 df_out = df_color.div(df_tot)
      7 #df.set_index(['Default','Letter','Color'],inplace = True)...

Here is my desired output:

df_out = pd.DataFrame([['Foo','A','Green',.2,.333],['Foo','A','Red',.4,.5],['Foo','B','Blue',.125,.1],['Foo','B','Red',.375,.25],['Foo','C','Orange',.5,.8]],columns = ['Default','Letter','Color','Value','Value2'])
​
print df_out
df_out = pd.DataFrame([['Foo','A','Green',.2,.333],['Foo','A','Red',.4,.5],['Foo','B','Blue',.125,.1],['Foo','B','Red',.375,.25],['Foo','C','Orange',.5,.8]],columns = ['Default','Letter','Color','Value','Value2'])

EDIT note that there are actually multiple value columns - for simplicity I just show one here, but the solution needs to handle 50-100 numerical value columns.

You can do this with a groupby . Checkout the tutorial on using groupby.

Note : this implementation assumes that the Total entry for each color is the last one for that color (as in the example) but this is easily modifiable.

cols = [x for x in df.columns if x not  in ['Default', 'Letter', 'Color']]  # or df.columns[3:]
df.loc[:, cols] = df.groupby('Letter', group_keys=False).apply(lambda df: df[cols] / df[cols].iloc[-1])
df[~(df['Color'] == 'Total')]

returns

  Default Letter   Color  Value    Value2
0     Foo      A   Green  0.200  0.333333
1     Foo      A     Red  0.400  0.500000
3     Foo      B    Blue  0.125  0.100000
4     Foo      B     Red  0.375  0.250000
6     Foo      C  Orange  0.500  0.800000

I ended up reformatting the datafames using the melt function so the column name became another column in the data. Then I could simply merge and divide, and reformat at the end

df = pd.DataFrame([['Foo','A','Green',10,20],['Foo','A','Red',20,30],['Foo','A','Total',50,60],['Foo','B','Blue',5,10],['Foo','B','Red',15,25],['Foo','B','Total',40,100],['Foo','C','Orange',25,8],['Foo','C','Total',50,10]],columns = ['Default','Letter','Color','Value','Value2'])

df_color = df[df['Color']!='Total']
df_tot = df[df['Color']=='Total'].drop(['Color'],axis = 1)

df_melt = pd.melt(df_color,id_vars = ['Default','Letter', 'Color'],var_name =['value_field'] )
df_tot_melt = pd.melt(df_tot,id_vars = ['Default','Letter'],var_name =['value_field'], value_name = 'Total')


df_melt_pct = pd.merge(df_melt, df_tot_melt, how = 'outer', on = ['Default','Letter','value_field'])
df_melt_pct['Pct'] = df_melt_pct['value'] /df_melt_pct['Total']
df_melt_pct = df_melt_pct.drop(['value','Total'],axis = 1).set_index(['Default','Letter','Color','value_field']).unstack()
df_melt_pct.columns = df_melt_pct.columns.droplevel(0)

print df_melt_pct

value_field            Value    Value2
Default Letter Color                  
Foo     A      Green   0.200  0.333333
               Red     0.400  0.500000
        B      Blue    0.125  0.100000
               Red     0.375  0.250000
        C      Orange  0.500  0.800000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM