I have a dataframe that looks as follows:
df = pd.DataFrame([['Foo','A','Green',10,20],['Foo','A','Red',20,30],['Foo','A','Total',50,60],['Foo','B','Blue',5,10],['Foo','B','Red',15,25],['Foo','B','Total',40,100],['Foo','C','Orange',25,8],['Foo','C','Total',50,10]],columns = ['Default','Letter','Color','Value'])
print df
Default Letter Color Value Value2
0 Foo A Green 10 20
1 Foo A Red 20 30
2 Foo A Total 50 60
3 Foo B Blue 5 10
4 Foo B Red 15 25
5 Foo B Total 40 100
6 Foo C Orange 25 8
7 Foo C Total 50 10
I need to find the percentage of the total row that each color makes up within each group
My first thought was to split them into separate indexes, and use .div, but in this case I have a multiindex (I know in my example the first all says Foo, but that's not how the real data looks - roll with it.) and I get the notImplemented Error.
df_color = df[df['Color']!='Total'].set_index(['Default','Letter','Color'])
df_tot = df[df['Color']=='Total'].drop(['Color'],axis = 1).set_index(['Default','Letter'])
df_out = df_color.div(df_tot)
NotImplementedError Traceback (most recent call last)
<ipython-input-119-0caf0e2959a6> in <module>()
4 df_tot = df[df['Color']=='Total'].drop(['Color'],axis = 1).set_index(['Default','Letter'])
5
----> 6 df_out = df_color.div(df_tot)
7 #df.set_index(['Default','Letter','Color'],inplace = True)...
Here is my desired output:
df_out = pd.DataFrame([['Foo','A','Green',.2,.333],['Foo','A','Red',.4,.5],['Foo','B','Blue',.125,.1],['Foo','B','Red',.375,.25],['Foo','C','Orange',.5,.8]],columns = ['Default','Letter','Color','Value','Value2'])
print df_out
df_out = pd.DataFrame([['Foo','A','Green',.2,.333],['Foo','A','Red',.4,.5],['Foo','B','Blue',.125,.1],['Foo','B','Red',.375,.25],['Foo','C','Orange',.5,.8]],columns = ['Default','Letter','Color','Value','Value2'])
EDIT note that there are actually multiple value columns - for simplicity I just show one here, but the solution needs to handle 50-100 numerical value columns.
You can do this with a groupby
. Checkout the tutorial on using groupby.
Note : this implementation assumes that the Total
entry for each color is the last one for that color (as in the example) but this is easily modifiable.
cols = [x for x in df.columns if x not in ['Default', 'Letter', 'Color']] # or df.columns[3:]
df.loc[:, cols] = df.groupby('Letter', group_keys=False).apply(lambda df: df[cols] / df[cols].iloc[-1])
df[~(df['Color'] == 'Total')]
returns
Default Letter Color Value Value2
0 Foo A Green 0.200 0.333333
1 Foo A Red 0.400 0.500000
3 Foo B Blue 0.125 0.100000
4 Foo B Red 0.375 0.250000
6 Foo C Orange 0.500 0.800000
I ended up reformatting the datafames using the melt function so the column name became another column in the data. Then I could simply merge and divide, and reformat at the end
df = pd.DataFrame([['Foo','A','Green',10,20],['Foo','A','Red',20,30],['Foo','A','Total',50,60],['Foo','B','Blue',5,10],['Foo','B','Red',15,25],['Foo','B','Total',40,100],['Foo','C','Orange',25,8],['Foo','C','Total',50,10]],columns = ['Default','Letter','Color','Value','Value2'])
df_color = df[df['Color']!='Total']
df_tot = df[df['Color']=='Total'].drop(['Color'],axis = 1)
df_melt = pd.melt(df_color,id_vars = ['Default','Letter', 'Color'],var_name =['value_field'] )
df_tot_melt = pd.melt(df_tot,id_vars = ['Default','Letter'],var_name =['value_field'], value_name = 'Total')
df_melt_pct = pd.merge(df_melt, df_tot_melt, how = 'outer', on = ['Default','Letter','value_field'])
df_melt_pct['Pct'] = df_melt_pct['value'] /df_melt_pct['Total']
df_melt_pct = df_melt_pct.drop(['value','Total'],axis = 1).set_index(['Default','Letter','Color','value_field']).unstack()
df_melt_pct.columns = df_melt_pct.columns.droplevel(0)
print df_melt_pct
value_field Value Value2
Default Letter Color
Foo A Green 0.200 0.333333
Red 0.400 0.500000
B Blue 0.125 0.100000
Red 0.375 0.250000
C Orange 0.500 0.800000
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.